pith. machine review for the scientific record. sign in

arxiv: 2605.05804 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

Na-IRSTD: Enhancing Infrared Small Target Detection via Native-Resolution Feature Selection and Fusion

Authors on Pith no claims yet

Pith reviewed 2026-05-08 14:50 UTC · model grok-4.3

classification 💻 cs.CV
keywords infrared small target detectionnative-resolution featurestoken reductionfeature selectionfeature fusionsmall object detectioncomputer visiondeep learning
0
0 comments X

The pith

A framework using full native resolution and selective token processing detects small dim targets in infrared images more accurately than downsampling methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that conventional downsampling in infrared small target detection discards the very details needed to find dim objects against clutter, and that keeping native resolution while smartly selecting key patches solves this without exploding computation. It introduces a token reduction and selection step that focuses processing on likely target areas to retain low-level cues. If correct, this would mean better localization of tiny signals in noisy scenes. Readers would care because many real monitoring and tracking tasks depend on not missing those faint targets.

Core claim

The Na-IRSTD framework extracts and fuses features at the original image resolution to preserve subtle target cues that downsampling loses, while an accompanying token reduction and selection strategy identifies target patches with high accuracy and confidence, thereby enhancing low-level details and keeping computation manageable, which enables state-of-the-art results on four standard benchmarks.

What carries the argument

Native-resolution feature extraction and fusion paired with a token reduction and selection strategy that prioritizes target patches.

If this is right

  • Small targets keep their subtle low-level cues because features stay at native resolution instead of being downsampled.
  • Computational load stays practical because only selected patches receive full native-resolution treatment rather than dense processing of every token.
  • Detection accuracy rises on benchmarks that feature complex background clutter.
  • The same selection mechanism proves robust when tested across multiple public infrared datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar native-resolution selection could help small-lesion detection in medical imaging or faint-object search in astronomy where detail loss from downsampling is also costly.
  • The work suggests that hybrid full-detail plus selective architectures may become preferable to uniform downsampling whenever the signal-to-noise ratio is low and selection can be made reliable.
  • Integrating the token selection with hardware-aware constraints could open deployment paths for on-device infrared monitoring systems.

Load-bearing premise

The token reduction and selection strategy reliably identifies target patches at high confidence without missing dim targets or adding bias when backgrounds are cluttered.

What would settle it

Running the model on a new infrared dataset containing many extremely faint targets in heavy clutter and finding that it misses more targets than a comparable downsampling baseline would show the native-resolution claim does not hold.

Figures

Figures reproduced from arXiv: 2605.05804 by Chi Zhang, Haojuan Yuan, Mingjin Zhang, Qian Xu, Qiming Zhang, Xi Li.

Figure 1
Figure 1. Figure 1: Visualization of feature maps across different downsampling stages. As the resolution decreases, small targets become indistinct and eventually vanish. In the native-resolution branch, the responses of small targets are progressively enhanced. By introducing this native-resolution branch, even extremely small targets can be preserved and successfully detected. TABLE I DETECTION PERFORMANCE WITH NATIVE-RESO… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed Na-IRSTD architecture. The upper pathway is the native-resolution branch and the lower pathway is the hierarchical backbone encoder. PDCE denotes the Patchwise Detail-Context Encoder, composed of PDE and GPM. PDE denotes Patchwise Detail Extraction, and GPM denotes Global Patch Mixer. For clarity, only one representative stage is expanded to illustrate the detailed operations, whil… view at source ↗
Figure 3
Figure 3. Figure 3: Statistical characteristics of IRSTD-Hard. (a) Target size cumulative distribution of different IRSTD datasets. (b) Scene distribution of IRSTD￾Hard. in Table II. Statistics in Table II are computed on the offi￾cial test partitions, and our IRSTD-Hard benchmark is also constructed solely from test partitions. While valuable for general evaluation, they are not explicitly designed to assess performance in t… view at source ↗
Figure 4
Figure 4. Figure 4: Visual results of different IRSTD methods. The boxes in red, yellow, and blue represent view at source ↗
Figure 5
Figure 5. Figure 5: Visual comparison of feature maps in backbone encoder and native-resolution branch. view at source ↗
Figure 7
Figure 7. Figure 7: Top-K patch coverage under two different training supervision strategies for the scoring network on NUDT-SIRST. Both curves are evaluated with the same coverage criterion at inference time. 0 2 4 6 8 10 12 14 16 18 20 FPS 86 88 90 92 94 96 98 100 IoU (%) DNANet ISNet UIUNet AGPCNet RPCA IRMamba Na-IRSTD(Ours) view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of patch-to-target relevance prediction. view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of IoU and throughput on a single Nvidia GeForce 4090 view at source ↗
read the original abstract

Infrared small target detection (IRSTD) faces the inherent challenge of precisely localizing dim targets amid complex background clutter. While progress has been made, existing methods usually follow conventional strategies to downsample features and discard small targets' details, resulting in suboptimal performance. In this paper, we present Na-IRSTD, a native-resolution feature extraction and fusion framework for IRSTD. This framework elegantly incorporates native-resolution features to preserve subtle target cues, overcoming the resolution limitations of existing infrared approaches and significantly improving the model's ability to localize small targets. We also introduce an effective token reduction and selection strategy, which selects target patches with high accuracy and confidence, boosting the low-level details of the feature while effectively reducing native-resolution patch tokens compared to dense processing, thereby avoiding imposing an unbearable computational burden. Extensive experiments demonstrate the robustness and effectiveness of our token reduction and selection strategy across multiple public datasets. Ultimately, our Na-IRSTD model achieves state-of-the-art performance on four benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes Na-IRSTD, a native-resolution feature extraction and fusion framework for infrared small target detection (IRSTD). It preserves low-level target details by avoiding conventional downsampling and introduces a token reduction and selection strategy that purportedly identifies target patches with high accuracy and confidence while controlling computational cost. The authors report extensive experiments demonstrating the strategy's robustness and claim state-of-the-art performance on four public benchmarks.

Significance. If the token selection mechanism reliably retains dim (<3-pixel) targets in clutter without systematic false negatives or bias, the approach could meaningfully improve localization accuracy over downsampling-based baselines by retaining native-resolution cues. The multi-dataset SOTA claim, if substantiated with proper ablations, would represent a practical advance in IRSTD; however, the absence of direct validation for selection precision on low-SNR targets leaves the performance gains' attribution uncertain.

major comments (1)
  1. [Method description and experiments] The central performance claim rests on the token reduction and selection strategy (abstract: 'selects target patches with high accuracy and confidence'). No quantitative evaluation of selection precision/recall against ground-truth target locations is provided, nor are there ablations replacing the selector with random or uniform sampling to isolate its contribution. If selection error exceeds a few percent on dim targets, the native-resolution advantage collapses and the reported SOTA gains cannot be attributed to the proposed mechanism.
minor comments (1)
  1. [Abstract] The abstract states 'extensive experiments' and 'SOTA performance' but supplies no numerical metrics, dataset names, or baseline comparisons; including at least the key mIoU or Pd/Fa numbers would strengthen the summary.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to improve the manuscript. We address the major comment on the evaluation of the token reduction and selection strategy below and will incorporate the suggested analyses in the revised version.

read point-by-point responses
  1. Referee: The central performance claim rests on the token reduction and selection strategy (abstract: 'selects target patches with high accuracy and confidence'). No quantitative evaluation of selection precision/recall against ground-truth target locations is provided, nor are there ablations replacing the selector with random or uniform sampling to isolate its contribution. If selection error exceeds a few percent on dim targets, the native-resolution advantage collapses and the reported SOTA gains cannot be attributed to the proposed mechanism.

    Authors: We agree that direct quantitative validation of the selector is necessary to fully attribute the reported gains. While the manuscript demonstrates overall robustness through multi-dataset experiments and qualitative results, explicit precision/recall metrics against ground-truth target locations (especially for dim, low-SNR targets) and controlled ablations against random/uniform sampling were not included. In the revision we will add a dedicated analysis section reporting selection precision and recall on the four benchmarks, with emphasis on targets smaller than 3 pixels, plus ablations that replace the proposed selector with random and uniform baselines while keeping all other components fixed. These additions will clarify the mechanism's contribution and address the attribution concern. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical architecture with no derivations

full rationale

The paper presents Na-IRSTD as an empirical neural network framework for infrared small target detection. It introduces a native-resolution feature extraction and fusion approach plus a token reduction/selection strategy, validated through experiments on public datasets and SOTA claims on four benchmarks. No equations, first-principles derivations, or predictions appear that could reduce to inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations are present in the provided text. The method is self-contained as an architectural proposal tested externally.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit equations or implementation details, so free parameters, axioms, and invented entities cannot be enumerated beyond the generic assumption that standard deep-learning training hyperparameters exist.

pith-pipeline@v0.9.0 · 5481 in / 1076 out tokens · 34101 ms · 2026-05-08T14:50:56.297848+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 2 canonical work pages

  1. [1]

    Infrared small target segmentation networks: A survey,

    R. Kou, C. Wang, Z. Peng, Z. Zhao, Y . Chen, J. Han, F. Huang, Y . Yu, and Q. Fu, “Infrared small target segmentation networks: A survey,” Pattern recognition, vol. 143, p. 109788, 2023

  2. [2]

    Review on recent develop- ment in infrared small target detection algorithms,

    S. S. Rawat, S. K. Verma, and Y . Kumar, “Review on recent develop- ment in infrared small target detection algorithms,”Procedia Computer Science, vol. 167, pp. 2496–2505, 2020

  3. [3]

    A multiscale fuzzy metric for detecting small infrared targets against chaotic cloudy/sea-sky backgrounds,

    H. Deng, X. Sun, and X. Zhou, “A multiscale fuzzy metric for detecting small infrared targets against chaotic cloudy/sea-sky backgrounds,”IEEE Transactions on Cybernetics, vol. 49, no. 5, pp. 1694–1707, 2019

  4. [4]

    Shifting neighbors within temporal contexts for slow-moving infrared small target detec- tion,

    Y . Zhu, Y . Ma, F. Fan, J. Huang, and G. Wang, “Shifting neighbors within temporal contexts for slow-moving infrared small target detec- tion,”IEEE Signal Processing Letters, 2025

  5. [5]

    The temporal-spatial information fusion network for multi-frame infrared small target detection,

    T. Ma, H. Wang, J. Liang, Y . Wang, J. Peng, Z. Kai, and X. Liu, “The temporal-spatial information fusion network for multi-frame infrared small target detection,”IEEE Transactions on Instrumentation and Measurement, 2025

  6. [6]

    Visible- thermal tiny object detection: A benchmark dataset and baselines,

    X. Ying, C. Xiao, W. An, R. Li, X. He, B. Li, X. Cao, Z. Li, Y . Wang, M. Hu, Q. Xu, Z. Lin, M. Li, S. Zhou, L. Liu, and W. Sheng, “Visible- thermal tiny object detection: A benchmark dataset and baselines,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 7, pp. 6088–6096, 2025

  7. [7]

    Infrared small target detection based on the weighted strengthened local contrast measure,

    J. Han, S. Moradi, I. Faramarzi, H. Zhang, Q. Zhao, X. Zhang, and N. Li, “Infrared small target detection based on the weighted strengthened local contrast measure,”IEEE Geoscience and Remote Sensing Letters, vol. 18, no. 9, pp. 1670–1674, 2020

  8. [8]

    Infrared small target detection based on partial sum of the tensor nuclear norm,

    L. Zhang and Z. Peng, “Infrared small target detection based on partial sum of the tensor nuclear norm,”Remote Sensing, vol. 11, no. 4, p. 382, 2019

  9. [9]

    Detection of small aerial object using random projection feature with region clustering,

    J. Wang, G. Zhang, K. Zhang, Y . Zhao, Q. Wang, and X. Li, “Detection of small aerial object using random projection feature with region clustering,”IEEE Transactions on Cybernetics, vol. 52, no. 5, pp. 3957– 3970, 2022

  10. [10]

    U-net: Convolutional networks for biomedical image segmentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inInternational Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241

  11. [11]

    Isnet: Shape matters for infrared small target detection,

    M. Zhang, R. Zhang, Y . Yang, H. Bai, J. Zhang, and J. Guo, “Isnet: Shape matters for infrared small target detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 877–886

  12. [12]

    Dstransnet: Dynamic feature selection network with feature enhancement and multi-attention for infrared small target detection,

    R. Huang, J. Huang, Y . Ma, F. Fan, and Y . Zhu, “Dstransnet: Dynamic feature selection network with feature enhancement and multi-attention for infrared small target detection,”IEEE Transactions on Geoscience and Remote Sensing, pp. 1–1, 2025

  13. [13]

    Msma-net: An infrared small target detection network by multiscale super-resolution enhancement and multilevel attention fusion,

    T. Ma, H. Wang, J. Liang, J. Peng, Q. Ma, and Z. Kai, “Msma-net: An infrared small target detection network by multiscale super-resolution enhancement and multilevel attention fusion,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–20, 2023

  14. [14]

    Irsam: Advancing segment anything model for infrared small target detection,

    M. Zhang, Y . Wang, J. Guo, Y . Li, X. Gao, and J. Zhang, “Irsam: Advancing segment anything model for infrared small target detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 233– 249

  15. [15]

    Attention and prediction-guided motion detection for low-contrast small moving targets,

    H. Wang, J. Zhao, H. Wang, C. Hu, J. Peng, and S. Yue, “Attention and prediction-guided motion detection for low-contrast small moving targets,”IEEE Transactions on Cybernetics, vol. 53, no. 10, pp. 6340– 6352, 2023

  16. [16]

    Toward accurate infrared small target detection via edge-aware gated transformer,

    Y . Zhu, Y . Ma, F. Fan, J. Huang, K. Wu, and G. Wang, “Toward accurate infrared small target detection via edge-aware gated transformer,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 8779–8793, 2024

  17. [17]

    Dcganet: Fusing selective variable convolution and dynamic content- guided attention for infrared small target detection,

    Y . Chen, Y . Zhu, S. Min, Z. Qiu, A. Hu, T. Wang, and T. Zhang, “Dcganet: Fusing selective variable convolution and dynamic content- guided attention for infrared small target detection,”Knowledge-Based Systems, p. 115546, 2026

  18. [18]

    Mcdnet: An infrared small target detection network using multi- criteria decision and adaptive labeling strategy,

    T. Ma, Q. Ma, Z. Yang, J. Liang, J. Fu, Y . Dou, Y . Ku, U. Ahmad, and L. Qu, “Mcdnet: An infrared small target detection network using multi- criteria decision and adaptive labeling strategy,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024

  19. [19]

    Unleashing the power of generic segmentation model: A simple baseline for infrared small target detection,

    M. Zhang, C. Zhang, Q. Zhang, Y . Li, X. Gao, and J. Zhang, “Unleashing the power of generic segmentation model: A simple baseline for infrared small target detection,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 10 392–10 401

  20. [20]

    Toward robust infrared small target detection via frequency and spatial feature fusion,

    Y . Zhu, Y . Ma, F. Fan, J. Huang, Y . Yao, X. Zhou, and R. Huang, “Toward robust infrared small target detection via frequency and spatial feature fusion,”IEEE Transactions on Geoscience and Remote Sensing, 2025

  21. [21]

    Relational part-aware learning for complex composite object detection in high-resolution remote sensing images,

    S. Yuan, L. Zhang, R. Dong, J. Xiong, J. Zheng, H. Fu, and P. Gong, “Relational part-aware learning for complex composite object detection in high-resolution remote sensing images,”IEEE Transactions on Cy- bernetics, vol. 54, no. 10, pp. 6118–6131, 2024

  22. [22]

    Infrared patch-image model for small target detection in a single image,

    C. Gao, D. Meng, Y . Yang, Y . Wang, X. Zhou, and A. G. Hauptmann, “Infrared patch-image model for small target detection in a single image,”IEEE transactions on image processing, vol. 22, no. 12, pp. 4996–5009, 2013

  23. [23]

    Infrared small target detection via non-convex rank approximation minimization joint l 2, 1 norm,

    L. Zhang, L. Peng, T. Zhang, S. Cao, and Z. Peng, “Infrared small target detection via non-convex rank approximation minimization joint l 2, 1 norm,”Remote Sensing, vol. 10, no. 11, p. 1821, 2018. 12

  24. [24]

    Reweighted infrared patch-tensor model with both nonlocal and local priors for single-frame small target detection,

    Y . Dai and Y . Wu, “Reweighted infrared patch-tensor model with both nonlocal and local priors for single-frame small target detection,”IEEE journal of selected topics in applied earth observations and remote sensing, vol. 10, no. 8, pp. 3752–3767, 2017

  25. [25]

    Infrared dim and small target detection via multiple subspace learning and spatial-temporal patch-tensor model,

    Y . Sun, J. Yang, and W. An, “Infrared dim and small target detection via multiple subspace learning and spatial-temporal patch-tensor model,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 5, pp. 3737–3752, 2020

  26. [26]

    Asymmetric contextual modulation for infrared small target detection,

    Y . Dai, Y . Wu, F. Zhou, and K. Barnard, “Asymmetric contextual modulation for infrared small target detection,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 950–959

  27. [27]

    Dense nested attention network for infrared small target detection,

    B. Li, C. Xiao, L. Wang, Y . Wang, Z. Lin, M. Li, W. An, and Y . Guo, “Dense nested attention network for infrared small target detection,” IEEE Transactions on Image Processing, vol. 32, pp. 1745–1758, 2022

  28. [28]

    Rkformer: Runge-kutta transformer with random-connection attention for infrared small target detection,

    M. Zhang, H. Bai, J. Zhang, R. Zhang, C. Wang, J. Guo, and X. Gao, “Rkformer: Runge-kutta transformer with random-connection attention for infrared small target detection,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 1730–1738

  29. [29]

    Single-frame infrared small target detection via gaussian curvature inspired network,

    M. Zhang, K. Yue, B. Li, J. Guo, Y . Li, and X. Gao, “Single-frame infrared small target detection via gaussian curvature inspired network,” IEEE Transactions on Geoscience and Remote Sensing, 2024

  30. [30]

    Faa-net: A frequency-aware attention network for single-frame infrared small target detection,

    S. Zhuang, Y . Hou, M. Qi, and D. Wang, “Faa-net: A frequency-aware attention network for single-frame infrared small target detection,”IEEE Transactions on Instrumentation and Measurement, 2025

  31. [31]

    Mim-istd: Mamba-in-mamba for efficient infrared small target detection,

    T. Chen, Z. Ye, Z. Tan, T. Gong, Y . Wu, Q. Chu, B. Liu, N. Yu, and J. Ye, “Mim-istd: Mamba-in-mamba for efficient infrared small target detection,”IEEE Transactions on Geoscience and Remote Sensing, 2024

  32. [32]

    Irmamba: Pixel difference mamba with layer restoration for infrared small target detection,

    M. Zhang, X. Li, F. Gao, and J. Guo, “Irmamba: Pixel difference mamba with layer restoration for infrared small target detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 9, 2025, pp. 10 003–10 011

  33. [33]

    Moe-ir: Infrared dim small target detection method with mixture of experts feature extraction,

    Z. Weng, X. Fu, X. Zhang, and S. Sun, “Moe-ir: Infrared dim small target detection method with mixture of experts feature extraction,” Electronics Letters, vol. 61, no. 1, p. e70359, 2025

  34. [34]

    Saist: Segment any infrared small target model guided by contrastive language- image pretraining,

    M. Zhang, X. Li, F. Gao, J. Guo, X. Gao, and J. Zhang, “Saist: Segment any infrared small target model guided by contrastive language- image pretraining,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 9549–9558

  35. [35]

    Dual-transformer feature enhancement for infrared small-dim target detection,

    G. Hu, L. Fan, H. Xu, C. Lin, X. Ding, and Y . Huang, “Dual-transformer feature enhancement for infrared small-dim target detection,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 19, pp. 342–356, 2026

  36. [36]

    Diffusion-based continuous feature representation for infrared small-dim target detection,

    L. Fan, Y . Wang, G. Hu, F. Li, Y . Dong, H. Zheng, C. Lin, Y . Huang, and X. Ding, “Diffusion-based continuous feature representation for infrared small-dim target detection,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–17, 2024

  37. [37]

    Deep high-resolution represen- tation learning for human pose estimation,

    K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution represen- tation learning for human pose estimation,” inCVPR, 2019

  38. [38]

    Hc-mamba: Vision mamba with hybrid convolutional techniques for medical image segmentation,

    J. Xu, “Hc-mamba: Vision mamba with hybrid convolutional techniques for medical image segmentation,”arXiv preprint arXiv:2405.05007, 2024

  39. [39]

    Content-adaptive downsam- pling in convolutional neural networks,

    R. Hesse, S. Schaub-Meyer, and S. Roth, “Content-adaptive downsam- pling in convolutional neural networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4544–4553

  40. [40]

    Progressive neighborhood aggregation for semantic segmentation refinement,

    T. Liu, Y . Wei, and Y . Zhang, “Progressive neighborhood aggregation for semantic segmentation refinement,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1737– 1745

  41. [41]

    Frequency-adaptive dilated con- volution for semantic segmentation,

    L. Chen, L. Gu, D. Zheng, and Y . Fu, “Frequency-adaptive dilated con- volution for semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3414–3425

  42. [42]

    Token merging: Your ViT but faster,

    D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, “Token merging: Your ViT but faster,” inInternational Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=JroZRaRw7Eu

  43. [43]

    DynamicViT: Efficient vision transformers with dynamic token sparsification,

    Y . Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh, “DynamicViT: Efficient vision transformers with dynamic token sparsification,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 13 937–13 949

  44. [44]

    A-ViT: Adaptive tokens for efficient vision transformer,

    H. Yin, A. Vahdat, J. M. Alvarez, A. Mallya, J. Kautz, and P. Molchanov, “A-ViT: Adaptive tokens for efficient vision transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2022, pp. 10 809–10 818

  45. [45]

    Speed-up of vision transformer models by attention-aware token filtering,

    T. Naruko and H. Akutsu, “Speed-up of vision transformer models by attention-aware token filtering,”arXiv preprint arXiv:2506.01519, 2025

  46. [46]

    AdaViT: Adaptive vision transformers for efficient image recognition,

    L. Meng, H. Li, B.-C. Chen, S. Lan, Z. Wu, Y .-G. Jiang, and S.-N. Lim, “AdaViT: Adaptive vision transformers for efficient image recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 309–12 318

  47. [47]

    Evo-ViT: Slow-fast token evolution for dynamic vision transformer,

    Y . Xu, Z. Zhang, M. Zhang, K. Sheng, K. Li, W. Dong, L. Zhang, C. Xu, and X. Sun, “Evo-ViT: Slow-fast token evolution for dynamic vision transformer,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, pp. 2964–2972, 2022

  48. [48]

    Dynamic token pruning in plain vision transformers for semantic segmentation,

    Q. Tang, B. Zhang, J. Liu, F. Liu, and Y . Liu, “Dynamic token pruning in plain vision transformers for semantic segmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 777–786

  49. [49]

    Dynamic token-pass transformers for semantic segmentation,

    Y . Liu, Q. Zhou, J. Wang, Z. Wang, F. Wang, J. Wang, and W. Zhang, “Dynamic token-pass transformers for semantic segmentation,” inPro- ceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision, 2024, pp. 1827–1836

  50. [50]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Dollar, and R. Girshick, “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 4015–4026

  51. [51]

    Uiu-net: U-net in u-net for infrared small object detection,

    X. Wu, D. Hong, and J. Chanussot, “Uiu-net: U-net in u-net for infrared small object detection,”IEEE Trans. Image. Process., vol. 32, pp. 364– 376, 2023

  52. [52]

    Attention-guided pyramid context networks for detecting infrared small target under complex background,

    T. Zhang, L. Li, S. Cao, T. Pu, and Z. Peng, “Attention-guided pyramid context networks for detecting infrared small target under complex background,”IEEE Transactions on Aerospace and Electronic Systems, vol. 59, no. 4, pp. 4250–4261, 2023

  53. [53]

    Rpcanet: Deep unfolding rpca based infrared small target detection,

    F. Wu, T. Zhang, L. Li, Y . Huang, and Z. Peng, “Rpcanet: Deep unfolding rpca based infrared small target detection,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 4809–4818