EchoSR: Efficient Context Harnessing for Lightweight Image Super-Resolution

Binhao Wang; Hanli Zhao; Kaihao Zhang; Shihao Zhao; Tao Wang; Wanglong Lu

arxiv: 2605.17470 · v2 · pith:XQTSFI4Xnew · submitted 2026-05-17 · 💻 cs.CV · cs.MM· eess.IV

EchoSR: Efficient Context Harnessing for Lightweight Image Super-Resolution

Hanli Zhao , Binhao Wang , Shihao Zhao , Tao Wang , Kaihao Zhang , Wanglong Lu This is my paper

Pith reviewed 2026-05-20 14:25 UTC · model grok-4.3

classification 💻 cs.CV cs.MMeess.IV

keywords lightweight super-resolutioncontext fusionmulti-scale modelingimage upscalingefficient neural networkshierarchical contextcomputer vision

0 comments

The pith

EchoSR splits feature processing into local, multi-scale, and global stages with overlapping fusion to deliver higher-quality lightweight super-resolution at roughly twice the speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EchoSR as a way to improve image super-resolution when computing resources are limited. It separates the work into three distinct stages that each focus on a different kind of context: nearby pixels, features at many different sizes, and the overall scene layout. These stages are then joined by a cross-scale overlapping fusion step that mixes the information without adding much extra work. Tests on standard image benchmarks show the method produces sharper results than earlier lightweight approaches while running about twice as fast. Readers would care if this makes detailed image enlargement practical on phones or other small hardware.

Core claim

EchoSR decouples feature learning into disentangled local, multi-scale, and global modeling stages through an efficient context-harnessing strategy, and further promotes seamless cross-scale integration via a cross-scale overlapping fusion mechanism, consistently outperforming state-of-the-art lightweight super-resolution methods across multiple benchmarks while achieving approximately 2x faster speed.

What carries the argument

Disentangled local, multi-scale, and global modeling stages together with a cross-scale overlapping fusion mechanism that unifies multi-scale receptive field modeling and hierarchical context fusion.

If this is right

Lightweight super-resolution models can reach higher reconstruction accuracy without large increases in computation.
The separation into local, multi-scale, and global stages followed by fusion supports efficient handling of context at different ranges.
Faster inference makes real-time upscaling feasible in settings with tight power or memory limits.
The same design choices produce gains on multiple common test sets for single-image super-resolution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The stage-separation idea could be tried in other efficiency-focused tasks such as image denoising or low-light enhancement.
Adding a temporal stage to the same disentanglement pattern might adapt the method for video super-resolution.
Checking performance on uncurated phone-camera photos could show whether benchmark gains carry over to everyday use.
If the fusion step proves general, it might reduce the need for hand-tuned scale-specific layers in other vision networks.

Load-bearing premise

The proposed disentangled stages and cross-scale overlapping fusion will combine into coherent results that deliver the claimed quality and speed gains without hidden extra costs or extra tuning.

What would settle it

Side-by-side timing and quality measurements on the same hardware and datasets where EchoSR fails to run approximately twice as fast or fails to exceed the PSNR and SSIM scores of prior top lightweight methods.

Figures

Figures reproduced from arXiv: 2605.17470 by Binhao Wang, Hanli Zhao, Kaihao Zhang, Shihao Zhao, Tao Wang, Wanglong Lu.

**Figure 2.** Figure 2: Comparisons on the Urban100 test set at ×2 scale with input resolution of 1024 × 1024. The area of each circle indicates peak memory usage during inference. EchoSR demonstrates the best balance among performance, memory consumption, and inference latency. achieves spatial structural rectification by aligning features from different spatial hierarchies. This mechanism ensures a gradual and coherent transiti… view at source ↗

**Figure 3.** Figure 3: Overview of our EchoSR architecture for lightweight image super-resolution. CHB extracts local, multi-scale, and global features in parallel, while [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Visual comparison of EchoSR (ours) and SOTA methods on the Urban100 benchmark for [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Visual comparisons of EchoSR (ours) and SOTA methods on the Urban100 benchmark for [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Visual comparison of pixel-wise error maps on the Urban100 ( [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Visual comparisons of EchoSR (ours) and SOTA methods on the Manga109 benchmark for [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Peak GPU memory usage (left) and average inference latency (right) of SR methods at different input resolutions under the [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Comparisons on the Urban100 test set at ×2 scale with input resolution of 256 × 256. The area of each circle indicates the computational complexity (MACs). EchoSR (ours) demonstrates a superior trade-off among reconstruction performance, computational efficiency, and inference speed. lation [8] and channel rearrangement [35], [41], which can come at the cost of neglecting the interaction of information flo… view at source ↗

**Figure 10.** Figure 10: Visualization of the ERF across different SR models, including Transformer-based (SwinIR, HiT-SIR), Mamba-based (MambaIR, MaIR), and CNN [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Visual comparisons of EchoSR-lite (ours) and SOTA tiny methods on the Urban100 benchmark for [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Visual comparisons of EchoSR (ours) and SOTA methods on the RealSR dataset for [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Visualization of feature maps in our MRFE module. We showcase outputs from different branches within MRFE. The identity mapping branch [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 15.** Figure 15: Visualization of COFB. After processing by the COFB module, the [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: Visualization of the ERF for the three parallel branches in the [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗

**Figure 17.** Figure 17: Quantitative analysis of ERF area ratio relative change across various cumulative contribution score thresholds ( [PITH_FULL_IMAGE:figures/full_fig_p018_17.png] view at source ↗

read the original abstract

Image super-resolution (SR) aims to reconstruct high-quality, high-resolution (HR) images from low-resolution (LR) inputs and plays a critical role in various downstream applications. Despite recent advancements, balancing reconstruction fidelity and computational efficiency remains a fundamental challenge, particularly in resource-constrained scenarios. While existing lightweight methods attempt to expand receptive fields, many of them either incur substantial computational overhead, naively scale up kernel sizes, or lack mechanisms for coherent multi-scale integration, limiting their overall effectiveness and scalability. To address these limitations, we propose EchoSR, an efficient context-harnessing framework for lightweight image super-resolution, which unifies multi-scale receptive field modeling and hierarchical context fusion. EchoSR decouples feature learning into disentangled local, multi-scale, and global modeling stages through an efficient context-harnessing strategy, and further promotes seamless cross-scale integration via a cross-scale overlapping fusion mechanism. Extensive experiments have shown that EchoSR consistently outperforms state-of-the-art lightweight super-resolution methods across multiple benchmarks, while also achieving a faster speed $(\sim 2\times)$. The source code is available at https://github.com/funnyWang-Echoes/EchoSR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EchoSR brings a disentangled local/multi-scale/global modeling setup plus cross-scale overlapping fusion to lightweight SR and claims solid quality gains at roughly 2x speed, but the efficiency numbers need close checking.

read the letter

The main thing here is that EchoSR splits feature processing into separate local, multi-scale, and global stages and then uses an overlapping fusion step to combine them across scales. The authors say this gives better PSNR and SSIM than recent lightweight models while running about twice as fast, and they have released the code at the GitHub link in the abstract. That combination of disentangled stages and the specific fusion looks like the actual new piece relative to the baselines they cite such as IMDN, RFDN, and CARN. The paper does a reasonable job explaining why prior approaches either add too much cost or fail to integrate context cleanly, and the public code is a plus for anyone who wants to inspect or reproduce the work. The experiments are described as extensive and consistent across benchmarks, which is the kind of evidence that matters for this area. The soft spots sit mostly on the efficiency side. The stress-test note is fair: the overlapping fusion could add memory traffic or synchronization cost that is not fully isolated in the reported numbers, and it is not obvious from the abstract whether the 2x speedup survives when everything is measured under identical PyTorch settings and input sizes. The abstract itself gives no concrete PSNR, SSIM, runtime, or ablation figures, so the gap between the claim and the verifiable support is real until the full results and component breakdowns are examined. If those sections show clean ablations that hold the parameter and FLOP budgets fixed, the concern shrinks. This paper is for people working on deployable super-resolution for mobile or edge hardware rather than for theorists chasing new mathematical insights. Readers who follow practical efficiency tweaks in computer vision will get usable architecture ideas and benchmark comparisons from it. The design is coherent on its own terms and engages the existing lightweight SR literature without obvious internal contradictions, so it deserves a serious referee even if the speed claims will probably draw questions. I would send it to peer review instead of a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes EchoSR, a lightweight image super-resolution framework that decouples feature learning into disentangled local, multi-scale, and global modeling stages via an efficient context-harnessing strategy and introduces a cross-scale overlapping fusion mechanism for hierarchical context integration. It claims that this unification enables consistent outperformance over state-of-the-art lightweight SR methods (e.g., IMDN, RFDN, CARN) across multiple benchmarks while delivering approximately 2x faster inference speed, with source code released.

Significance. If the empirical claims hold under rigorous verification, the work would advance lightweight SR by addressing receptive-field expansion without naive kernel scaling or excessive overhead, offering a practical unification of multi-scale modeling and fusion that could benefit real-time applications on edge devices. The public code release supports reproducibility, which strengthens the contribution relative to purely empirical papers lacking such artifacts.

major comments (2)

[§4] §4 (Experiments) and associated tables: the headline claim of ~2x faster speed and superior PSNR/SSIM is load-bearing for the central contribution, yet the manuscript provides no ablation that removes only the cross-scale overlapping fusion block while holding stage channel counts and other parameters fixed; without this, it is impossible to isolate whether fusion overhead negates the reported latency gains under standardized PyTorch/CUDA timing at fixed input resolutions.
[§3.2] §3.2 (Cross-scale overlapping fusion): the mechanism description asserts coherent integration without substantial computational overhead, but contains no FLOPs or memory-traffic bound on the overlapping feature exchange; this directly risks the efficiency claim when compared to prior lightweight baselines at identical parameter/FLOP budgets.

minor comments (2)

[Figure 2] Figure 2 (architecture diagram): the flow between local/multi-scale/global branches and the fusion module would benefit from explicit arrow labels indicating tensor shapes or channel counts to clarify the disentanglement.
[§4.1] §4.1 (Datasets and metrics): specify the exact training/validation splits and whether results are averaged over multiple random seeds with standard deviations, as the abstract asserts 'consistent' outperformance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline the revisions planned to strengthen the empirical validation of our efficiency claims.

read point-by-point responses

Referee: §4 (Experiments) and associated tables: the headline claim of ~2x faster speed and superior PSNR/SSIM is load-bearing for the central contribution, yet the manuscript provides no ablation that removes only the cross-scale overlapping fusion block while holding stage channel counts and other parameters fixed; without this, it is impossible to isolate whether fusion overhead negates the reported latency gains under standardized PyTorch/CUDA timing at fixed input resolutions.

Authors: We agree that an ablation isolating the cross-scale overlapping fusion block (with all other stage channel counts and hyperparameters held fixed) would provide clearer evidence for the source of the reported latency gains. In the revised manuscript we will add this experiment to §4. The variant without the fusion block will be evaluated on the same benchmarks, hardware, and standardized PyTorch/CUDA timing protocol used for the main results, allowing direct quantification of any overhead introduced by the fusion mechanism. revision: yes
Referee: §3.2 (Cross-scale overlapping fusion): the mechanism description asserts coherent integration without substantial computational overhead, but contains no FLOPs or memory-traffic bound on the overlapping feature exchange; this directly risks the efficiency claim when compared to prior lightweight baselines at identical parameter/FLOP budgets.

Authors: We acknowledge that explicit FLOPs and memory-traffic bounds for the overlapping feature exchange would better support the efficiency assertions. In the revised §3.2 we will insert a dedicated complexity analysis that derives the additional FLOPs and memory traffic of the cross-scale overlapping fusion and compares these quantities to the overall model budget as well as to the corresponding costs in the cited lightweight baselines (IMDN, RFDN, CARN) at matched parameter and FLOP counts. Empirical measurements on the same hardware will also be reported. revision: yes

Circularity Check

0 steps flagged

No circularity in EchoSR empirical architecture proposal

full rationale

The paper introduces EchoSR as an empirical neural architecture for lightweight super-resolution, with claims resting on benchmark experiments and speed measurements rather than any closed-form derivation or prediction. No equations, fitted parameters renamed as outputs, or self-citation chains are present in the provided text that would reduce the central claims to inputs by construction. The design choices (disentangled stages and fusion) are presented as engineering decisions validated externally via comparisons to IMDN, RFDN, etc., making the work self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard deep-learning assumptions plus several design choices introduced in the paper. No machine-checked proofs or parameter-free derivations are mentioned.

free parameters (1)

stage channel counts and fusion kernel sizes
Design hyperparameters that define the local, multi-scale, and global branches and the overlapping fusion; these are chosen to balance efficiency and performance.

axioms (1)

domain assumption Disentangling feature learning into independent local, multi-scale, and global stages plus cross-scale overlapping fusion yields coherent integration without substantial overhead.
Invoked in the abstract description of the framework as the basis for the efficiency and performance claims.

invented entities (1)

EchoSR context-harnessing modules (local/multi-scale/global branches and cross-scale overlapping fusion) no independent evidence
purpose: To unify multi-scale receptive field modeling and hierarchical context fusion in a lightweight manner.
New architectural components introduced by the paper; no independent evidence outside the claimed experiments is provided in the abstract.

pith-pipeline@v0.9.0 · 5749 in / 1575 out tokens · 36356 ms · 2026-05-20T14:25:08.664770+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EchoSR decouples feature learning into disentangled local, multi-scale, and global modeling stages... cross-scale overlapping fusion mechanism

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

[1]

S. Liu, W. Li, D. He, G. Wang, Y . Huang, Ssefusion: Salient semantic enhancement for multimodal medical image fusion with mamba and dy- namic spiking neural networks, Information Fusion 119 (2025) 103031

work page 2025
[2]

J. Qu, D. Huang, Y . Shi, J. Liu, W. Tang, Entropy-aware dynamic path selection network for multi-modality medical image fusion, Information Fusion 123 (2025) 103312

work page 2025
[3]

D. K. Jain, X. Zhao, C. Gan, P. K. Shukla, A. Jain, S. Sharma, Fusion- driven deep feature network for enhanced object detection and tracking in video surveillance systems, Information Fusion 109 (2024) 102429

work page 2024
[4]

Zhang, T

W. Zhang, T. Li, Y . Zhang, G. Pei, X. Jiang, Y . Yao, Ltformer: A light-weight transformer-based self-supervised matching network for heterogeneous remote sensing images, Information Fusion 109 (2024) 102425

work page 2024
[5]

J. Liu, R. Xu, Y . Duan, T. Guo, G. Shi, F. Luo, Mdgf-cd: Land-cover change detection with multi-level diffformer feature grouping fusion for vhr remote sensing images, Information Fusion 120 (2025) 103110

work page 2025
[6]

W. Lu, J. Wang, X. Jin, X. Jiang, H. Zhao, Facemug: A multimodal generative and fusion framework for local facial editing, IEEE Trans. Vis. Comput. Gr. (2024) 1–15

work page 2024
[7]

W. Lu, J. Wang, T. Wang, K. Zhang, X. Jiang, H. Zhao, Visual style prompt learning using diffusion models for blind face restoration, Pattern Recognit. 161 (2025) 111312

work page 2025
[8]

Y . Wang, T. Su, Y . Li, J. Cao, G. Wang, X. Liu, Ddistill-sr: Reparameter- ized dynamic distillation network for lightweight image super-resolution, IEEE Trans. Multim. 25 (2023) 7222–7234

work page 2023
[9]

B. Lim, S. Son, H. Kim, S. Nah, K. M. Lee, Enhanced deep residual net- works for single image super-resolution, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2017, pp. 1132–1140

work page 2017
[10]

Z. Hui, X. Gao, Y . Yang, X. Wang, Lightweight image super-resolution with information multi-distillation network, in: ACM Int. Conf. Multi- media, 2019, pp. 2024–2032

work page 2019
[11]

Liang, J

J. Liang, J. Cao, G. Sun, K. Zhang, L. V . Gool, R. Timofte, Swinir: Image restoration using swin transformer, in: Proc. IEEE Int. Conf. Comput. Vis. Workshops, 2021, pp. 1833–1844

work page 2021
[12]

Z. Chen, Y . Zhang, J. Gu, L. Kong, X. Yang, F. Yu, Dual aggrega- tion transformer for image super-resolution, in: Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 12278–12287

work page 2023
[13]

H. Choi, J. Lee, J. Yang, N-gram in swin transformers for efficient lightweight image super-resolution, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 2071–2081

work page 2023
[14]

Y . Zhou, Z. Li, C. Guo, S. Bai, M. Cheng, Q. Hou, Srformer: Permuted self-attention for single image super-resolution, in: Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 12734–12745

work page 2023
[15]

A. Gu, T. Dao, Mamba: Linear-time sequence modeling with selective state spaces, in: First Conference on Language Modeling, 2024

work page 2024
[16]

H. Guo, J. Li, T. Dai, Z. Ouyang, X. Ren, S. Xia, Mambair: A simple baseline for image restoration with state-space model, in: Proc. Eur. Conf. Comput. Vis., V ol. 15076, 2024, pp. 222–241

work page 2024
[17]

B. Li, H. Zhao, W. Wang, P. Hu, Y . Gou, X. Peng, Mair: A locality- and continuity-preserving mamba for image restoration, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2025

work page 2025
[18]

H. Feng, L. Wang, Y . Li, A. Du, LKASR: large kernel attention for lightweight image super-resolution, Knowl. Based Syst. 252 (2022) 109376

work page 2022
[19]

Y . Wang, Y . Li, G. Wang, X. Liu, Multi-scale attention network for single image super-resolution, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2024, pp. 5950–5960

work page 2024
[20]

X. Ding, X. Zhang, J. Han, G. Ding, Scaling up your kernels to 31×31: Revisiting large kernel design in cnns, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 11953–11965

work page 2022
[21]

W. Yu, M. Luo, P. Zhou, C. Si, Y . Zhou, X. Wang, J. Feng, S. Yan, Metaformer is actually what you need for vision, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., IEEE, 2022, pp. 10809–10819

work page 2022
[22]

J. Kim, J. K. Lee, K. M. Lee, Accurate image super-resolution using very deep convolutional networks, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1646–1654

work page 2016
[23]

Zamfir, Z

E. Zamfir, Z. Wu, N. Mehta, Y . Zhang, R. Timofte, See more details: Efficient image super-resolution by experts mining, in: Proc. Int. Conf. Mach. Learn., 2024

work page 2024
[24]

Zhang, H

X. Zhang, H. Zeng, S. Guo, L. Zhang, Efficient long-range attention network for image super-resolution, in: Proc. Eur. Conf. Comput. Vis., V ol. 13677, 2022, pp. 649–667

work page 2022
[25]

Zhang, Y

X. Zhang, Y . Zhang, F. Yu, Hit-sr: Hierarchical transformer for efficient image super-resolution, in: Proc. Eur. Conf. Comput. Vis., V ol. 15098, 2024, pp. 483–500

work page 2024
[26]

A. Gu, T. Dao, Mamba: Linear-time sequence modeling with selective state spaces, CoRR abs/2312.00752 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

M. Guo, C. Lu, Z. Liu, M. Cheng, S. Hu, Visual attention network, Comput. Vis. Media 9 (4) (2023) 733–752

work page 2023
[28]

S. Liu, T. Chen, X. Chen, X. Chen, Q. Xiao, B. Wu, T. K ¨arkk¨ainen, M. Pechenizkiy, D. C. Mocanu, Z. Wang, More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity, in: Proc. Int. Conf. Learn. Represent., 2023

work page 2023
[29]

W. Yu, P. Zhou, S. Yan, X. Wang, Inceptionnext: When inception meets convnext, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 5672–5683

work page 2024
[30]

X. Ding, Y . Zhang, Y . Ge, S. Zhao, L. Song, X. Yue, Y . Shan, Unireplknet: A universal perception large-kernel convnet for audio, video, point cloud, time-ignoreseries and image recognition, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 5513–5524

work page 2024
[31]

G. Wu, J. Jiang, J. Jiang, X. Liu, Transforming image super-resolution: A convformer-based efficient approach, IEEE Trans. Image Process. 33 (2024) 6071–6082

work page 2024
[32]

Z. Liu, H. Mao, C. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A convnet for the 2020s, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 11966–11976

work page 2022
[33]

M. Tan, Q. V . Le, Mixconv: Mixed depthwise convolutional kernels, in: Proc. Brit. Mach. Vis. Conf., BMV A Press, 2019, p. 74

work page 2019
[34]

C. Dong, C. C. Loy, K. He, X. Tang, Learning a deep convolutional network for image super-resolution, in: Proc. Eur. Conf. Comput. Vis., V ol. 8692, 2014, pp. 184–199

work page 2014
[35]

L. Sun, J. Pan, J. Tang, Shufflemixer: An efficient convnet for image super-resolution, in: Proc. Adv. Neural Inf. Process. Syst., 2022

work page 2022
[36]

Behjati, P

P. Behjati, P. Rodr ´ıguez, C. Fern´andez, I. Hupont, A. Mehri, J. Gonz`alez, Single image super-resolution based on directional variance attention network, Pattern Recognit. 133 (2023) 108997

work page 2023
[37]

H. Wang, X. Chen, B. Ni, Y . Liu, J. Liu, Omni aggregation networks for lightweight image super-resolution, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 22378–22387

work page 2023
[38]

A. Li, L. Zhang, Y . Liu, C. Zhu, Exploring frequency-inspired opti- mization in transformer for efficient single image super-resolution, IEEE Trans. Pattern Anal. Mach. Intell. 47 (4) (2025) 3141–3158

work page 2025
[39]

Timofte, E

R. Timofte, E. Agustsson, L. V . Gool, M. Yang, NTIRE 2017 challenge on single image super-resolution: Methods and results, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2017, pp. 1110–1121

work page 2017
[40]

B. Lim, S. Son, H. Kim, S. Nah, K. M. Lee, Enhanced deep residual net- works for single image super-resolution, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1132–1140

work page 2017
[41]

L. Sun, J. Dong, J. Tang, J. Pan, Spatially-adaptive feature modulation for efficient image super-resolution, in: Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 13144–13153

work page 2023
[42]

S. Li, Z. Wang, Z. Liu, C. Tan, H. Lin, D. Wu, Z. Chen, J. Zheng, S. Z. Li, Moganet: Multi-order gated aggregation network, in: Proc. Int. Conf. Learn. Represent., 2024

work page 2024
[43]

Y . Wang, T. Zhang, Osffnet: Omni-stage feature fusion network for lightweight image super-resolution, in: Proc. AAAI Conf. Artif. Intell., 2024, pp. 5660–5668

work page 2024
[44]

F. Li, R. Cong, J. Wu, H. Bai, M. Wang, Y . Zhao, Srconvnet: A transformer-style convnet for lightweight image super-resolution, Int. J. Comput. Vis. 133 (1) (2025) 173–189

work page 2025
[45]

W. Luo, Y . Li, R. Urtasun, R. S. Zemel, Understanding the effective receptive field in deep convolutional neural networks, in: Adv. Neural Inform. Process. Syst., 2016, pp. 4898–4906

work page 2016
[46]

Y . Blau, T. Michaeli, The perception-distortion tradeoff, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6228–6237

work page 2018
[47]

Zheng, L

M. Zheng, L. Sun, J. Dong, J. Pan, Smfanet: A lightweight self- modulation feature aggregation network for efficient image super- resolution, in: Proc. Eur. Conf. Comput. Vis., V ol. 15108, 2024, pp. 359–375

work page 2024
[48]

X. Wang, L. Xie, C. Dong, Y . Shan, Real-esrgan: Training real-world blind super-resolution with pure synthetic data, in: Proc. IEEE Int. Conf. Comput. Vis. Workshops, 2021, pp. 1905–1914

work page 2021
[49]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008

work page 2017
[50]

W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, T. Lu, P. Luo, L. Shao, PVT v2: Improved baselines with pyramid vision transformer, Comput. Vis. Media 8 (3) (2022) 415–424

work page 2022

[1] [1]

S. Liu, W. Li, D. He, G. Wang, Y . Huang, Ssefusion: Salient semantic enhancement for multimodal medical image fusion with mamba and dy- namic spiking neural networks, Information Fusion 119 (2025) 103031

work page 2025

[2] [2]

J. Qu, D. Huang, Y . Shi, J. Liu, W. Tang, Entropy-aware dynamic path selection network for multi-modality medical image fusion, Information Fusion 123 (2025) 103312

work page 2025

[3] [3]

D. K. Jain, X. Zhao, C. Gan, P. K. Shukla, A. Jain, S. Sharma, Fusion- driven deep feature network for enhanced object detection and tracking in video surveillance systems, Information Fusion 109 (2024) 102429

work page 2024

[4] [4]

Zhang, T

W. Zhang, T. Li, Y . Zhang, G. Pei, X. Jiang, Y . Yao, Ltformer: A light-weight transformer-based self-supervised matching network for heterogeneous remote sensing images, Information Fusion 109 (2024) 102425

work page 2024

[5] [5]

J. Liu, R. Xu, Y . Duan, T. Guo, G. Shi, F. Luo, Mdgf-cd: Land-cover change detection with multi-level diffformer feature grouping fusion for vhr remote sensing images, Information Fusion 120 (2025) 103110

work page 2025

[6] [6]

W. Lu, J. Wang, X. Jin, X. Jiang, H. Zhao, Facemug: A multimodal generative and fusion framework for local facial editing, IEEE Trans. Vis. Comput. Gr. (2024) 1–15

work page 2024

[7] [7]

W. Lu, J. Wang, T. Wang, K. Zhang, X. Jiang, H. Zhao, Visual style prompt learning using diffusion models for blind face restoration, Pattern Recognit. 161 (2025) 111312

work page 2025

[8] [8]

Y . Wang, T. Su, Y . Li, J. Cao, G. Wang, X. Liu, Ddistill-sr: Reparameter- ized dynamic distillation network for lightweight image super-resolution, IEEE Trans. Multim. 25 (2023) 7222–7234

work page 2023

[9] [9]

B. Lim, S. Son, H. Kim, S. Nah, K. M. Lee, Enhanced deep residual net- works for single image super-resolution, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2017, pp. 1132–1140

work page 2017

[10] [10]

Z. Hui, X. Gao, Y . Yang, X. Wang, Lightweight image super-resolution with information multi-distillation network, in: ACM Int. Conf. Multi- media, 2019, pp. 2024–2032

work page 2019

[11] [11]

Liang, J

J. Liang, J. Cao, G. Sun, K. Zhang, L. V . Gool, R. Timofte, Swinir: Image restoration using swin transformer, in: Proc. IEEE Int. Conf. Comput. Vis. Workshops, 2021, pp. 1833–1844

work page 2021

[12] [12]

Z. Chen, Y . Zhang, J. Gu, L. Kong, X. Yang, F. Yu, Dual aggrega- tion transformer for image super-resolution, in: Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 12278–12287

work page 2023

[13] [13]

H. Choi, J. Lee, J. Yang, N-gram in swin transformers for efficient lightweight image super-resolution, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 2071–2081

work page 2023

[14] [14]

Y . Zhou, Z. Li, C. Guo, S. Bai, M. Cheng, Q. Hou, Srformer: Permuted self-attention for single image super-resolution, in: Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 12734–12745

work page 2023

[15] [15]

A. Gu, T. Dao, Mamba: Linear-time sequence modeling with selective state spaces, in: First Conference on Language Modeling, 2024

work page 2024

[16] [16]

H. Guo, J. Li, T. Dai, Z. Ouyang, X. Ren, S. Xia, Mambair: A simple baseline for image restoration with state-space model, in: Proc. Eur. Conf. Comput. Vis., V ol. 15076, 2024, pp. 222–241

work page 2024

[17] [17]

B. Li, H. Zhao, W. Wang, P. Hu, Y . Gou, X. Peng, Mair: A locality- and continuity-preserving mamba for image restoration, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2025

work page 2025

[18] [18]

H. Feng, L. Wang, Y . Li, A. Du, LKASR: large kernel attention for lightweight image super-resolution, Knowl. Based Syst. 252 (2022) 109376

work page 2022

[19] [19]

Y . Wang, Y . Li, G. Wang, X. Liu, Multi-scale attention network for single image super-resolution, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2024, pp. 5950–5960

work page 2024

[20] [20]

X. Ding, X. Zhang, J. Han, G. Ding, Scaling up your kernels to 31×31: Revisiting large kernel design in cnns, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 11953–11965

work page 2022

[21] [21]

W. Yu, M. Luo, P. Zhou, C. Si, Y . Zhou, X. Wang, J. Feng, S. Yan, Metaformer is actually what you need for vision, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., IEEE, 2022, pp. 10809–10819

work page 2022

[22] [22]

J. Kim, J. K. Lee, K. M. Lee, Accurate image super-resolution using very deep convolutional networks, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1646–1654

work page 2016

[23] [23]

Zamfir, Z

E. Zamfir, Z. Wu, N. Mehta, Y . Zhang, R. Timofte, See more details: Efficient image super-resolution by experts mining, in: Proc. Int. Conf. Mach. Learn., 2024

work page 2024

[24] [24]

Zhang, H

X. Zhang, H. Zeng, S. Guo, L. Zhang, Efficient long-range attention network for image super-resolution, in: Proc. Eur. Conf. Comput. Vis., V ol. 13677, 2022, pp. 649–667

work page 2022

[25] [25]

Zhang, Y

X. Zhang, Y . Zhang, F. Yu, Hit-sr: Hierarchical transformer for efficient image super-resolution, in: Proc. Eur. Conf. Comput. Vis., V ol. 15098, 2024, pp. 483–500

work page 2024

[26] [26]

A. Gu, T. Dao, Mamba: Linear-time sequence modeling with selective state spaces, CoRR abs/2312.00752 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

M. Guo, C. Lu, Z. Liu, M. Cheng, S. Hu, Visual attention network, Comput. Vis. Media 9 (4) (2023) 733–752

work page 2023

[28] [28]

S. Liu, T. Chen, X. Chen, X. Chen, Q. Xiao, B. Wu, T. K ¨arkk¨ainen, M. Pechenizkiy, D. C. Mocanu, Z. Wang, More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity, in: Proc. Int. Conf. Learn. Represent., 2023

work page 2023

[29] [29]

W. Yu, P. Zhou, S. Yan, X. Wang, Inceptionnext: When inception meets convnext, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 5672–5683

work page 2024

[30] [30]

X. Ding, Y . Zhang, Y . Ge, S. Zhao, L. Song, X. Yue, Y . Shan, Unireplknet: A universal perception large-kernel convnet for audio, video, point cloud, time-ignoreseries and image recognition, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 5513–5524

work page 2024

[31] [31]

G. Wu, J. Jiang, J. Jiang, X. Liu, Transforming image super-resolution: A convformer-based efficient approach, IEEE Trans. Image Process. 33 (2024) 6071–6082

work page 2024

[32] [32]

Z. Liu, H. Mao, C. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A convnet for the 2020s, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 11966–11976

work page 2022

[33] [33]

M. Tan, Q. V . Le, Mixconv: Mixed depthwise convolutional kernels, in: Proc. Brit. Mach. Vis. Conf., BMV A Press, 2019, p. 74

work page 2019

[34] [34]

C. Dong, C. C. Loy, K. He, X. Tang, Learning a deep convolutional network for image super-resolution, in: Proc. Eur. Conf. Comput. Vis., V ol. 8692, 2014, pp. 184–199

work page 2014

[35] [35]

L. Sun, J. Pan, J. Tang, Shufflemixer: An efficient convnet for image super-resolution, in: Proc. Adv. Neural Inf. Process. Syst., 2022

work page 2022

[36] [36]

Behjati, P

P. Behjati, P. Rodr ´ıguez, C. Fern´andez, I. Hupont, A. Mehri, J. Gonz`alez, Single image super-resolution based on directional variance attention network, Pattern Recognit. 133 (2023) 108997

work page 2023

[37] [37]

H. Wang, X. Chen, B. Ni, Y . Liu, J. Liu, Omni aggregation networks for lightweight image super-resolution, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 22378–22387

work page 2023

[38] [38]

A. Li, L. Zhang, Y . Liu, C. Zhu, Exploring frequency-inspired opti- mization in transformer for efficient single image super-resolution, IEEE Trans. Pattern Anal. Mach. Intell. 47 (4) (2025) 3141–3158

work page 2025

[39] [39]

Timofte, E

R. Timofte, E. Agustsson, L. V . Gool, M. Yang, NTIRE 2017 challenge on single image super-resolution: Methods and results, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2017, pp. 1110–1121

work page 2017

[40] [40]

B. Lim, S. Son, H. Kim, S. Nah, K. M. Lee, Enhanced deep residual net- works for single image super-resolution, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1132–1140

work page 2017

[41] [41]

L. Sun, J. Dong, J. Tang, J. Pan, Spatially-adaptive feature modulation for efficient image super-resolution, in: Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 13144–13153

work page 2023

[42] [42]

S. Li, Z. Wang, Z. Liu, C. Tan, H. Lin, D. Wu, Z. Chen, J. Zheng, S. Z. Li, Moganet: Multi-order gated aggregation network, in: Proc. Int. Conf. Learn. Represent., 2024

work page 2024

[43] [43]

Y . Wang, T. Zhang, Osffnet: Omni-stage feature fusion network for lightweight image super-resolution, in: Proc. AAAI Conf. Artif. Intell., 2024, pp. 5660–5668

work page 2024

[44] [44]

F. Li, R. Cong, J. Wu, H. Bai, M. Wang, Y . Zhao, Srconvnet: A transformer-style convnet for lightweight image super-resolution, Int. J. Comput. Vis. 133 (1) (2025) 173–189

work page 2025

[45] [45]

W. Luo, Y . Li, R. Urtasun, R. S. Zemel, Understanding the effective receptive field in deep convolutional neural networks, in: Adv. Neural Inform. Process. Syst., 2016, pp. 4898–4906

work page 2016

[46] [46]

Y . Blau, T. Michaeli, The perception-distortion tradeoff, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6228–6237

work page 2018

[47] [47]

Zheng, L

M. Zheng, L. Sun, J. Dong, J. Pan, Smfanet: A lightweight self- modulation feature aggregation network for efficient image super- resolution, in: Proc. Eur. Conf. Comput. Vis., V ol. 15108, 2024, pp. 359–375

work page 2024

[48] [48]

X. Wang, L. Xie, C. Dong, Y . Shan, Real-esrgan: Training real-world blind super-resolution with pure synthetic data, in: Proc. IEEE Int. Conf. Comput. Vis. Workshops, 2021, pp. 1905–1914

work page 2021

[49] [49]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008

work page 2017

[50] [50]

W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, T. Lu, P. Luo, L. Shao, PVT v2: Improved baselines with pyramid vision transformer, Comput. Vis. Media 8 (3) (2022) 415–424

work page 2022