arxiv: 2605.00886 · v1 · submitted 2026-04-27 · 💻 cs.CV

Recognition: unknown

Selective Attention-Based Network for Robust Infrared Small Target Detection

Yingming Zhang , Wuqi Su , Qing Xiao , Yonggang Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords infrared small target detectionselective attentiondual-path semantic moduleselective attention fusionU-Netskip connectionsconvolutional block attentionsmall object detection

0 comments

The pith

SANet improves infrared small target detection by fixing information bottlenecks and static skip connections in U-Net with dual-path semantic modules and selective attention fusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that existing encoder-decoder networks lose fine details about tiny infrared targets during early processing and cannot adaptively separate real targets from clutter using fixed skip connections. SANet counters these issues by adding a Dual-path Semantic-aware Module that preserves local spatial information through standard and pinwheel convolutions while recalibrating features with attention, plus a Selective Attention Fusion Module that learns context-aware weights to combine features across scales. If these additions succeed, systems could identify dim sub-pixel targets more reliably with fewer false alarms in cluttered scenes. A sympathetic reader would care because current methods frequently miss or misclassify such targets, limiting performance in surveillance, rescue, and warning applications.

Core claim

The authors claim that augmenting the U-Net framework with a Dual-path Semantic-aware Module (DSM) and a Selective Attention Fusion Module (SAFM) overcomes the information bottleneck in early convolutional stages and the lack of adaptability in static skip connections, enabling more robust discrimination between genuine infrared small targets and pseudo-target regions induced by complex backgrounds.

What carries the argument

The Dual-path Semantic-aware Module (DSM) that pairs standard convolutions for local detail preservation with pinwheel-shaped convolutions for expanded directional receptive fields before applying Convolutional Block Attention Module (CBAM) recalibration, together with the Selective Attention Fusion Module (SAFM) that replaces static skips with spatially adaptive learnable weighting for cross-scale feature fusion.

If this is right

Early convolutional layers retain more fine-grained spatial details of sub-pixel targets instead of losing them to bottlenecks.
Skip connections become dynamic and context-sensitive, reducing confusion between genuine targets and structurally similar background elements.
The network achieves higher robustness in low signal-to-clutter ratio conditions without requiring changes to the overall U-Net encoder-decoder structure.
Feature fusion across scales adapts per location, improving precision in highly cluttered infrared scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The selective fusion idea could transfer to other small-object tasks in visible or multispectral imagery where scale variance and clutter create similar problems.
Ablating the pinwheel convolution path separately from the standard path would isolate how much the directional receptive fields contribute versus the attention recalibration.
The same modules might be tested on datasets with targets of varying aspect ratios to check if direction sensitivity provides consistent benefits.

Load-bearing premise

The DSM and SAFM modules will deliver superior discrimination between real targets and background clutter on diverse real-world infrared data without causing overfitting or compute costs that erase the gains.

What would settle it

A head-to-head test on standard IRSTD benchmarks where SANet shows no gain in detection probability or reduction in false alarm rate over baseline U-Net variants with attention would show the modules do not solve the stated bottlenecks.

read the original abstract

Infrared small target detection (IRSTD) plays a pivotal role in a broad spectrum of mission-critical applications, including maritime surveillance, military search and rescue, early warning systems, and precision-guided strikes, all of which demand the precise identification of dim, sub-pixel targets amid highly cluttered infrared backgrounds. Despite significant progress driven by deep learning methods, fundamental challenges persist: infrared small targets occupy extremely limited spatial extents (often only a few pixels), exhibit low signal-to-clutter ratios, and are easily confused with structurally complex backgrounds that frequently induce false alarms. Existing encoder-decoder architectures suffer from two key limitations - an information bottleneck in early convolutional stages that undermines fine-grained target perception, and static skip connections that lack the dynamic adaptability required to discriminate between genuine targets and pseudo-target regions. To address these challenges, we propose SANet, a Selective Attention-based Network built upon the classical U-Net framework and augmented with two novel components: (1) a \emph{Dual-path Semantic-aware Module} (DSM) that integrates standard convolutions for local spatial detail preservation with pinwheel-shaped convolutions for expanded, direction-sensitive receptive fields, followed by a Convolutional Block Attention Module (CBAM) for fine-grained spatial-channel feature recalibration; and (2) a \emph{Selective Attention Fusion Module} (SAFM) that replaces conventional static skip connections with a spatially adaptive, learnable weighting mechanism to perform context-aware, cross-scale feature fusion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SANet describes a U-Net variant with DSM (pinwheel conv + CBAM) and SAFM adaptive skips for IR small targets, but supplies no experiments so the performance claims stay untested.

read the letter

The main takeaway is that this paper outlines SANet for infrared small target detection but stops short of any results, ablations, or comparisons, leaving the value of the new modules unclear from the text alone. The authors identify real issues in existing encoder-decoder setups: early convolutional stages lose fine details on sub-pixel targets, and static skip connections fail to separate genuine signals from background clutter. They respond with two modules built on a U-Net backbone. DSM runs parallel paths of standard and pinwheel-shaped convolutions for local detail plus direction-sensitive fields, then applies CBAM for recalibration. SAFM replaces fixed skips with a learnable, spatially adaptive fusion that weights features across scales based on context. These are legitimate engineering choices for the domain. The motivation section is clear and ties directly to practical needs in surveillance and defense. The specific pairing of pinwheel convolutions inside DSM with selective skip weighting is not a standard combination in the cited prior work. That said, the paper offers only architectural descriptions and assertions. There are no detection rates, false-alarm counts, mIoU scores, dataset details, or runtime figures. Without those, it is impossible to know whether the added complexity improves discrimination or simply adds parameters that overfit on limited infrared data. The work is incremental rather than a shift in approach, combining established elements like CBAM and U-Net with targeted tweaks. This kind of paper is useful to researchers already working on infrared small target detection who want module ideas to test in their own pipelines. A reader outside that niche will find little to take away. It deserves a serious referee once the authors add a full experimental section with standard benchmarks and controls. As presented, the claims rest on unverified assumptions about net gains.

Referee Report

2 major / 0 minor

Summary. The paper proposes SANet, a U-Net-based encoder-decoder architecture for infrared small target detection (IRSTD). It introduces two modules to address limitations in prior work: the Dual-path Semantic-aware Module (DSM), which combines standard convolutions with pinwheel-shaped convolutions and CBAM for local detail preservation and direction-sensitive receptive fields, and the Selective Attention Fusion Module (SAFM), which replaces static skip connections with a learnable, spatially adaptive weighting mechanism for context-aware cross-scale fusion.

Significance. If validated, the approach could improve robustness in detecting dim sub-pixel targets amid clutter by mitigating early-stage information loss and enabling dynamic feature selection, with relevance to surveillance and defense applications. The architectural focus on direction-sensitive fields and adaptive fusion targets specific IRSTD challenges, but the absence of supporting experiments limits assessment of net gains over baselines.

major comments (2)

[Abstract] Abstract and method sections: the central claim that DSM and SAFM resolve the information bottleneck and static skip-connection limitations rests entirely on architectural description without any quantitative results (e.g., Pd, Fa, mIoU), ablation studies, or comparisons on IRSTD datasets, so the performance improvements cannot be evaluated.
[Method] No equations, placement diagrams, or complexity analysis are supplied for DSM (pinwheel conv + CBAM) or SAFM, preventing verification that the added direction-sensitive fields and learnable weighting deliver discrimination gains rather than neutral or overfit behavior on real infrared data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the presentation of our contributions. We address each major comment below and have revised the manuscript to incorporate the requested details and supporting evidence.

read point-by-point responses

Referee: [Abstract] Abstract and method sections: the central claim that DSM and SAFM resolve the information bottleneck and static skip-connection limitations rests entirely on architectural description without any quantitative results (e.g., Pd, Fa, mIoU), ablation studies, or comparisons on IRSTD datasets, so the performance improvements cannot be evaluated.

Authors: We agree that the original abstract and method sections relied primarily on architectural motivation without embedding quantitative support. In the revised manuscript, we have updated the abstract to include a concise summary of empirical results on standard IRSTD datasets (e.g., gains in Pd and reductions in Fa relative to baselines). We have also added explicit cross-references in the method section to the Experiments section, which now details ablation studies, mIoU metrics, and comparisons against prior IRSTD methods. These changes allow readers to directly assess the claimed improvements. revision: yes
Referee: [Method] No equations, placement diagrams, or complexity analysis are supplied for DSM (pinwheel conv + CBAM) or SAFM, preventing verification that the added direction-sensitive fields and learnable weighting deliver discrimination gains rather than neutral or overfit behavior on real infrared data.

Authors: We acknowledge the need for these technical specifications to enable verification. The revised manuscript now includes: formal equations for the pinwheel convolution operation, the dual-path processing and CBAM recalibration within DSM, and the spatially adaptive weighting in SAFM; detailed placement diagrams of the overall U-Net architecture with module locations and modified skip connections; and a complexity analysis comparing parameter counts and FLOPs against the baseline U-Net and competing approaches. These additions, together with the expanded experimental results on real infrared data (including cross-dataset tests), support that the modules provide meaningful discrimination gains. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural proposal is self-contained empirical design

full rationale

The paper introduces SANet as a U-Net variant augmented with two new modules (DSM and SAFM) whose descriptions consist of architectural choices (pinwheel convolutions, CBAM, learnable skip weighting) rather than any derivation chain, equations, or predictions. No self-citations, fitted parameters renamed as outputs, or ansatzes appear in the provided text; the central claims rest on the modules' intended behavior and will be assessed via future experiments on IR data. This is a standard non-circular empirical architecture paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard deep-learning assumptions that convolutional feature extractors plus attention can be trained to discriminate targets from clutter; no additional free parameters or invented physical entities are introduced beyond network weights.

axioms (2)

domain assumption Convolutional layers with attention can preserve fine-grained spatial details better than standard encoder stages
Invoked when describing the information bottleneck problem and the role of DSM
domain assumption Learnable weighting can outperform static skip connections for cross-scale fusion in this domain
Core premise of SAFM

pith-pipeline@v0.9.0 · 5558 in / 1220 out tokens · 41600 ms · 2026-05-09T19:52:52.403889+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Infrared small target detection: A comprehensive review and analysis,

S. S. Rawat, S. Verma, and A. Kumar, “Infrared small target detection: A comprehensive review and analysis,” 7 Expert Systems with Applications, vol. 247, p. 123305, 2024

2024
[2]

A survey on deep learning-based infrared small target detection,

B. Li and C. Xiao, “A survey on deep learning-based infrared small target detection,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, 2024

2024
[3]

PBT: Progressive background-aware transformer for in- frared small target detection,

H. Yang, T. Mu, Z. Dong, Z. Zhang, B. Wang, and W. Ke, “PBT: Progressive background-aware transformer for in- frared small target detection,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, p. 5004513, 2024

2024
[4]

Asymmet- ric contextual modulation for infrared small target de- tection,

Y . Dai, Y . Wu, F. Zhou, and K. Barnard, “Asymmet- ric contextual modulation for infrared small target de- tection,” inProceedings of the IEEE/CVF Winter Con- ference on Applications of Computer Vision, 2021, pp. 949–958

2021
[5]

ISNet: Shape matters for infrared small target detection,

M. Zhang, R. Zhang, Y . Yang, H. Bai, J. Zhang, and J. Guo, “ISNet: Shape matters for infrared small target detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 867–876

2022
[6]

Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset IRDST,

H. Sun, J. Bai, F. Yang, and X. Bai, “Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset IRDST,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, p. 5000513, 2023

2023
[7]

A local contrast method for infrared small- target detection utilizing a tri-layer window,

J. Han, S. Moradi, I. Faramarzi, C. Liu, H. Zhang, and Q. Zhao, “A local contrast method for infrared small- target detection utilizing a tri-layer window,”IEEE Geo- science and Remote Sensing Letters, vol. 17, no. 10, pp. 1822–1826, 2020

2020
[8]

Infrared small target detection based on the weighted strengthened local contrast measure,

J. Han, S. Moradi, I. Faramarzi, H. Zhang, Q. Zhao, and X. Zhao, “Infrared small target detection based on the weighted strengthened local contrast measure,”IEEE Geoscience and Remote Sensing Letters, vol. 18, no. 9, pp. 1670–1674, 2021

2021
[9]

Infrared small target detection via adaptive M-estimator ring top- hat transformation,

L. Deng, J. Zhang, G. Xu, and H. Zhu, “Infrared small target detection via adaptive M-estimator ring top- hat transformation,”Pattern Recognition, vol. 112, p. 107729, 2021

2021
[10]

Analysis of new top-hat transfor- mation and the application for infrared dim small target detection,

X. Bai and F. Zhou, “Analysis of new top-hat transfor- mation and the application for infrared dim small target detection,”Pattern Recognition, vol. 43, no. 6, pp. 2145– 2156, 2010

2010
[11]

Reweighted infrared patch-tensor model with both nonlocal and local priors for single- frame small target detection,

Y . Dai and Y . Wu, “Reweighted infrared patch-tensor model with both nonlocal and local priors for single- frame small target detection,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sens- ing, vol. 10, no. 8, pp. 3752–3767, 2017

2017
[12]

Infrared dim and small tar- get detection via multiple subspace learning and spatial- temporal patch-tensor model,

Y . Sun, J. Yang, and W. An, “Infrared dim and small tar- get detection via multiple subspace learning and spatial- temporal patch-tensor model,”IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 5, pp. 3737–3752, 2021

2021
[13]

Infrared small target detection based on partial sum of the tensor nuclear norm,

L. Zhang and Z. Peng, “Infrared small target detection based on partial sum of the tensor nuclear norm,”Remote Sensing, vol. 11, no. 4, p. 382, 2019

2019
[14]

Infrared small target detection based on facet kernel and random walker,

Y . Qin, L. Bruzzone, C. Gao, and B. Li, “Infrared small target detection based on facet kernel and random walker,”IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 9, pp. 7104–7118, 2019

2019
[15]

U-Net: Convo- lutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convo- lutional networks for biomedical image segmentation,” inInternational Conference on Medical Image Com- puting and Computer-Assisted Intervention. Springer, 2015, pp. 234–241

2015
[16]

Dense nested attention network for infrared small tar- get detection,

B. Li, C. Xiao, L. Wang, Y . Wang, Z. Liu, and M. Li, “Dense nested attention network for infrared small tar- get detection,”IEEE Transactions on Image Processing, vol. 32, pp. 1745–1758, 2023

2023
[17]

Saliency at the helm: Steering infrared small target de- tection with learnable kernels,

F. Wu, A. Liu, T. Zhang, L. Zhang, J. Luo, and Z. Peng, “Saliency at the helm: Steering infrared small target de- tection with learnable kernels,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, p. 5000514, 2025

2025
[18]

Efficiently Modeling Long Sequences with Structured State Spaces

A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,”arXiv preprint arXiv:2111.00396, 2021

work page internal anchor Pith review arXiv 2021
[19]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review arXiv 2023
[20]

MiM-ISTD: Mamba-in-Mamba for efficient in- frared small-target detection,

T. Chen, Z. Ye, Z. Tan, T. Gong, Y . Wu, Q. Chu et al., “MiM-ISTD: Mamba-in-Mamba for efficient in- frared small-target detection,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, p. 5007613, 2024

2024
[21]

IRMamba: Pixel difference Mamba with layer restoration for in- frared small target detection,

M. Zhang, X. Li, F. Gao, and J. Guo, “IRMamba: Pixel difference Mamba with layer restoration for in- frared small target detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, 2025, pp. 10 003–10 011

2025
[22]

NAS-FAS: Static-dynamic central difference network search for face anti-spoofing,

Z. Yu, J. Wan, Y . Qin, X. Li, S. Z. Li, and G. Zhao, “NAS-FAS: Static-dynamic central difference network search for face anti-spoofing,”IEEE Transactions on Pat- tern Analysis and Machine Intelligence, vol. 43, no. 9, pp. 3005–3023, 2021

2021
[23]

UNet++: A nested U-Net architecture for med- ical image segmentation,

Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang, “UNet++: A nested U-Net architecture for med- ical image segmentation,”Deep Learning in Medical Im- age Analysis and Multimodal Learning for Clinical De- cision Support, pp. 3–11, 2018

2018
[24]

Pinwheel-shaped convolution and scale-based dynamic loss for infrared small target detection,

J. Yang, S. Liu, J. Wu, X. Su, N. Hai, and X. Huang, “Pinwheel-shaped convolution and scale-based dynamic loss for infrared small target detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, 2025, pp. 9202–9210

2025
[25]

CBAM: Convolutional block attention module,

S. Woo, J. Park, J.-Y . Lee, and I. S. Kweon, “CBAM: Convolutional block attention module,” inProceed- ings of the European Conference on Computer Vision. Springer, 2018, pp. 3–19

2018
[26]

Infrared small target detection based on non-convex op- timization with Lp-norm constraint,

T. Zhang, H. Wu, Y . Liu, L. Peng, C. Yang, and Z. Peng, “Infrared small target detection based on non-convex op- timization with Lp-norm constraint,”Remote Sensing, vol. 11, no. 5, p. 559, 2019. 8

2019
[27]

Miss detection vs. false alarm: Adversarial learning for small object seg- mentation in infrared images,

H. Wang, L. Zhou, and L. Wang, “Miss detection vs. false alarm: Adversarial learning for small object seg- mentation in infrared images,” inProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2019, pp. 8508–8517

2019
[28]

Attentional lo- cal contrast networks for infrared small target detection,

Y . Dai, Y . Wu, F. Zhou, and K. Barnard, “Attentional lo- cal contrast networks for infrared small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 11, pp. 9813–9824, 2021

2021
[29]

UIU-Net: U-Net in U-Net for infrared small object detection,

X. Wu, D. Hong, and J. Chanussot, “UIU-Net: U-Net in U-Net for infrared small object detection,”IEEE Trans- actions on Image Processing, vol. 32, pp. 364–376, 2023

2023
[30]

RP- CANet: Deep unfolding RPCA based infrared small tar- get detection,

F. Wu, T. Zhang, L. Li, Y . Huang, and Z. Peng, “RP- CANet: Deep unfolding RPCA based infrared small tar- get detection,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 4797–4806

2024
[31]

ABC: Attention with bilinear correlation for infrared small target detec- tion,

P. Pan, H. Wang, C. Wang, and C. Nie, “ABC: Attention with bilinear correlation for infrared small target detec- tion,” pp. 2381–2386, 2023

2023
[32]

SC- TransNet: Spatial-channel cross transformer network for infrared small target detection,

S. Yuan, H. Qin, X. Yan, N. Akhtar, and A. Mian, “SC- TransNet: Spatial-channel cross transformer network for infrared small target detection,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, p. 5002615, 2024

2024
[33]

Attention- guided pyramid context networks for detecting infrared small target under complex background,

T. Zhang, L. Li, S. Cao, T. Pu, and Z. Peng, “Attention- guided pyramid context networks for detecting infrared small target under complex background,”IEEE Trans- actions on Aerospace and Electronic Systems, vol. 59, no. 4, pp. 4250–4261, 2023

2023
[34]

Single-frame infrared small target detection via Gaus- sian curvature inspired network,

M. Zhang, K. Yue, B. Li, J. Guo, Y . Li, and X. Gao, “Single-frame infrared small target detection via Gaus- sian curvature inspired network,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, p. 5005013, 2024

2024
[35]

Computational fluid dynamic network for in- frared small target detection,

M. Zhang, K. Yue, J. Guo, Q. Zhang, J. Zhang, and X. Gao, “Computational fluid dynamic network for in- frared small target detection,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 8, pp. 14 777–14 789, 2025

2025
[36]

HD- Net: A hybrid domain network with multiscale high- frequency information enhancement for infrared small- target detection,

M. Xu, C. Yu, Z. Li, H. Tang, Y . Hu, and L. Nie, “HD- Net: A hybrid domain network with multiscale high- frequency information enhancement for infrared small- target detection,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, p. 5004115, 2025

2025
[37]

Infrared small target detection with scale and location sensitivity,

Q. Liu, R. Liu, B. Zheng, H. Wang, and Y . Fu, “Infrared small target detection with scale and location sensitivity,” inProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024, pp. 17 490– 17 499

2024
[38]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafsonet al., “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3992–4003

2023
[39]

IRSAM: Advancing segment anything model for in- frared small target detection,

M. Zhang, Y . Wang, J. Guo, Y . Li, X. Gao, and J. Zhang, “IRSAM: Advancing segment anything model for in- frared small target detection,” inProceedings of the Eu- ropean Conference on Computer Vision. Springer, 2024, pp. 233–249

2024
[40]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural In- formation Processing Systems, vol. 30, 2017

2017
[41]

Squeeze-and-excitation net- works,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation net- works,”Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132– 7141, 2018

2018
[42]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis- senborn, X. Zhai, T. Unterthineret al., “An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Rep- resentations, 2021

2021
[43]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022

2021
[44]

Multi-scale context aggregation by dilated convolutions,

F. Yu and V . Koltun, “Multi-scale context aggregation by dilated convolutions,”International Conference on Learning Representations, 2016

2016
[45]

DeepLab: Semantic image segmen- tation with deep convolutional nets, atrous convolution, and fully connected CRFs,

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic image segmen- tation with deep convolutional nets, atrous convolution, and fully connected CRFs,”IEEE Transactions on Pat- tern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2018

2018
[46]

Deformable convolutional networks,

J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei, “Deformable convolutional networks,” inPro- ceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 764–773

2017
[47]

Optimizing intersection- over-union in deep neural networks for image segmenta- tion,

M. A. Rahman and Y . Wang, “Optimizing intersection- over-union in deep neural networks for image segmenta- tion,” inInternational Symposium on Visual Computing. Springer, 2016, pp. 234–244

2016
[48]

MSHNet: Multi-scale head network for infrared small target detection,

X. Tonget al., “MSHNet: Multi-scale head network for infrared small target detection,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, 2024. 9

2024