pith. machine review for the scientific record. sign in

arxiv: 2604.17585 · v1 · submitted 2026-04-19 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

DGSSM: Diffusion guided state-space models for multimodal salient object detection

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords salient object detectiondiffusion modelsstate space modelsMambamultimodalRGB-DRGB-Tboundary refinement
0
0 comments X

The pith

Diffusion-guided Mamba models treat multimodal salient object detection as iterative denoising to recover sharper object boundaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that combines the structural priors from diffusion models with the efficient long-range reasoning of Mamba state-space models for detecting salient objects in RGB, depth, and thermal images. It recasts the detection task as a progressive denoising process that incorporates multi-scale encoding, adaptive prompting, and iterative refinement to address boundary inaccuracies common in pure convolutional, transformer, or Mamba approaches. Experiments across 13 benchmarks show consistent gains in multiple metrics alongside a compact model size. A sympathetic reader would care because precise boundary recovery in multimodal settings directly supports applications needing accurate object outlines without heavy computational cost.

Core claim

The authors formulate multimodal salient object detection as a progressive denoising process and integrate diffusion structural priors with multi-scale state space encoding, adaptive saliency prompting, and an iterative Mamba diffusion refinement mechanism, augmented by a boundary-aware refinement head and self-distillation, to achieve superior boundary accuracy and overall performance.

What carries the argument

The DGSSM framework, which integrates diffusion structural priors into Mamba-based state space modeling via progressive denoising, multi-scale encoding, and iterative refinement.

If this is right

  • Outperforms prior methods on RGB, RGB-D, and RGB-T benchmarks in standard evaluation metrics.
  • Maintains compact model size while delivering the performance gains.
  • The boundary-aware head and self-distillation improve spatial coherence and feature consistency.
  • The approach suggests diffusion-guided state space modeling as a generalizable paradigm for other multimodal dense prediction tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The denoising formulation could extend naturally to video sequences where temporal consistency might further stabilize boundaries across frames.
  • If the refinement mechanism proves robust, similar diffusion-Mamba hybrids might reduce reliance on large transformer backbones in other dense vision tasks.
  • Compact size combined with boundary gains could enable deployment on edge devices for real-time multimodal sensing.

Load-bearing premise

That the integration of diffusion priors with Mamba encoding and refinement steps improves boundary accuracy in multimodal settings without introducing new limitations such as instability or excessive compute.

What would settle it

A direct comparison on the 13 benchmarks showing no improvement in boundary-specific metrics like boundary F-measure or mean absolute error when the diffusion guidance and iterative Mamba refinement are removed.

Figures

Figures reproduced from arXiv: 2604.17585 by Arijit Sur, Pinaki Mitra, Suklav Ghosh.

Figure 1
Figure 1. Figure 1: Accuracy efficiency trade-off between FLOPs, performance (Fm), and param￾eters. Bubble area denotes model size. DGSSM achieves superior accuracy with lower computational cost. Recent advances in SOD have largely been driven by deep learning mod￾els based on convolutional neural networks (CNNs) and transformers. These approaches have achieved strong performance across diverse settings, including RGB, RGB-D,… view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of DGSSM. A diffusion structural prior guides a hierar￾chical state space encoder with adaptive saliency prompting and multi-scale selective scanning to capture global context and structural cues. The decoder produces a coarse saliency map. It is further refined by a boundary-aware head and an iterative Mamba diffusion refinement module, yielding accurate and boundary-preserving salien… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Comparison of our DGSSM against SOTA methods 4.10 Qualitative Analysis [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
read the original abstract

Salient object detection (SOD) requires modeling both long-range contextual dependencies and fine-grained structural details, which remains challenging for convolutional, transformer-based, and Mamba-based state space models. While recent Mamba-based state space approaches enable efficient global reasoning, they often struggle to recover precise object boundaries. In contrast, diffusion models capture strong structural priors through iterative denoising, but their use in discriminative dense prediction is still limited due to computational cost and integration challenges. In this work, we propose DGSSM, a diffusion-guided state space (Mamba) framework that formulates multimodal salient object detection as a progressive denoising process. The framework integrates diffusion structural priors with multi-scale state space encoding, adaptive saliency prompting, and an iterative Mamba diffusion refinement mechanism to improve boundary accuracy. A boundary-aware refinement head and self-distillation strategy further enhance spatial coherence and feature consistency. Extensive experiments on 13 public benchmarks across RGB, RGB-D, and RGB-T settings demonstrate that DGSSM consistently outperforms state-of-the-art methods across multiple evaluation metrics while maintaining a compact model size. These results suggest that diffusion-guided state space modeling is an effective and generalizable paradigm for multimodal dense prediction tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes DGSSM, a diffusion-guided state-space (Mamba) framework for multimodal salient object detection (SOD). It formulates the task as a progressive denoising process that integrates diffusion structural priors with multi-scale state-space encoding, adaptive saliency prompting, an iterative Mamba diffusion refinement mechanism, a boundary-aware refinement head, and self-distillation. The central empirical claim is that this architecture consistently outperforms prior state-of-the-art methods across multiple metrics on 13 public benchmarks spanning RGB, RGB-D, and RGB-T settings while preserving a compact model size.

Significance. If the reported gains in boundary accuracy and cross-modal generalization hold under rigorous scrutiny, the work would demonstrate a practical and efficient way to inject generative structural priors into discriminative state-space backbones for dense prediction. The emphasis on compactness alongside performance improvements would be a notable strength for deployment-oriented multimodal vision tasks.

minor comments (2)
  1. The abstract and introduction refer to '13 public benchmarks' and 'multiple evaluation metrics' without enumerating the exact datasets or metrics in the provided summary; a dedicated table or section listing them would improve reproducibility.
  2. The description of the iterative Mamba diffusion refinement mechanism would benefit from an explicit algorithmic outline or pseudocode to clarify the number of refinement steps and how the diffusion schedule interacts with the state-space layers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation of minor revision. The provided summary accurately captures the core contributions of DGSSM, including its formulation as a progressive denoising process and the empirical results across 13 benchmarks. As the report contains no specific major comments, we have no points requiring rebuttal or revision at this time.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical architecture proposal for multimodal SOD that combines diffusion priors with Mamba-based state-space encoding, adaptive prompting, and refinement heads. All load-bearing claims rest on experimental results across 13 benchmarks rather than any closed mathematical derivation, self-referential definition of terms, or fitted-parameter prediction that reduces to the inputs by construction. No equations, uniqueness theorems, or ansatzes are invoked that loop back to the model's own fitted values or prior self-citations in a load-bearing way; the derivation chain is therefore self-contained and externally falsifiable via the reported metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

With only the abstract available, specific free parameters, axioms, or invented entities cannot be identified. The framework introduces concepts like 'adaptive saliency prompting' and 'boundary-aware refinement head' but their details are not provided.

pith-pipeline@v0.9.0 · 5510 in / 1324 out tokens · 43381 ms · 2026-05-10T05:40:44.119707+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 5 canonical work pages

  1. [1]

    In:Proceedings of the International Conference on Pattern Recognition (ICPR 2024)

    Li, L., Lu, L., et al.: Transformer-based depth optimization network for RGB-D salient object detection. In:Proceedings of the International Conference on Pattern Recognition (ICPR 2024). Springer (2024)

  2. [2]

    In:Proceedings of the International Conference on Pat- tern Recognition (ICPR 2024)

    Liang, W., et al.: External prompt features enhanced parameter-efficient fine-tuning for salient object detection. In:Proceedings of the International Conference on Pat- tern Recognition (ICPR 2024). Springer (2024)

  3. [3]

    In:Proceedings of the 26th International Conference on Pattern Recognition (ICPR 2022)

    Englebert, A., Cornu, O., De Vleeschouwer, C.: Backward recursive class activa- tion map refinement for high resolution saliency map. In:Proceedings of the 26th International Conference on Pattern Recognition (ICPR 2022). IEEE (2022)

  4. [4]

    In:Proceedings of the 26th International Confer- ence on Pattern Recognition (ICPR 2022)

    Lin, Y., et al.: A lightweight multi-scale context network for salient object detection in optical remote sensing images. In:Proceedings of the 26th International Confer- ence on Pattern Recognition (ICPR 2022). IEEE (2022)

  5. [5]

    In:Proceedings of the 26th International Con- ference on Pattern Recognition (ICPR 2022)

    Zhang, Y., Hamidouche, W., Deforges, O.: Channel-spatial mutual attention net- work for 360 salient object detection. In:Proceedings of the 26th International Con- ference on Pattern Recognition (ICPR 2022). IEEE (2022)

  6. [6]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp

    Zhao, J.-X., Liu, J.-J., Fan, D.-P., Cao, Y., Yang, J., Cheng, M.-M.: EGNet: Edge guidance network for salient object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8779–8788 (2019)

  7. [7]

    IEEE TIP31, 3125–3136 (2022)

    Wu,Y.-H.,Liu,Y.,Zhang,L.,Cheng,M.-M.,Ren,B.:EDN:Salientobjectdetection via extremely-downsampled network. IEEE TIP31, 3125–3136 (2022)

  8. [8]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),(2023)

    Wang, Y., Wang, R., Fan, X., Wang, T., He, X.: Pixels, regions, and objects: Mul- tiple enhancement for salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),(2023)

  9. [9]

    IEEE TPAMI (2024)

    Liu, N., Luo, Z., Zhang, N., Han, J.: VST++: Efficient and stronger visual saliency transformer. IEEE TPAMI (2024)

  10. [10]

    IEEE TIP32, 1026–1038 (2023) DGSSM: Diffusion-Guided Mamba for Multimodal Salient Object Detection 13

    Ma, M., Xia, C., Xie, C., Chen, X., Li, J.: Boosting broader receptive fields for salient object detection. IEEE TIP32, 1026–1038 (2023) DGSSM: Diffusion-Guided Mamba for Multimodal Salient Object Detection 13

  11. [11]

    IEEE TPAMI45(3), 3738–3752 (2022)

    Zhuge, M., Fan, D.-P., Liu, N., Zhang, D., Xu, D., Shao, L.: Salient object detection via integrity learning. IEEE TPAMI45(3), 3738–3752 (2022)

  12. [12]

    IEEE TNNLS35(3)(2024)

    Chen, Q., Zhang, Z., Lu, Y., Fu, K., Zhao, Q.: 3-D convolutional neural networks for RGB-D salient object detection and beyond. IEEE TNNLS35(3)(2024)

  13. [13]

    CVPR 2020

    Fu, K., Fan, D.-P., Ji, G.-P., Zhao, Q.: JL-DCF: Joint learning and densely- cooperative fusion framework for RGB-D salient object detection. CVPR 2020

  14. [14]

    International Journal of Com- puter Vision, pp

    Hu, X., Sun, F., Sun, J., Wang, F., Li, H.: Cross-modal fusion and progressive de- coding network for RGB-D salient object detection. International Journal of Com- puter Vision, pp. 1–19 (2024)

  15. [15]

    In: Proceedings of ECCV

    Lee, M., Park, C., Cho, S., Lee, S.: SPSN: Superpixel prototype sampling network for RGB-D salient object detection. In: Proceedings of ECCV. Springer. (2022)

  16. [16]

    IEEE Transactions on Multimedia (2023)

    Sun, F., Ren, P., Yin, B., Wang, F., Li, H.: CATNet: A cascaded and aggregated transformer network for RGB-D salient object detection. IEEE Transactions on Multimedia (2023)

  17. [17]

    ICCV, pp

    Zhou, T., Fu, H., Chen, G., Zhou, Y., Fan, D.-P., Shao, L.: Specificity-preserving RGB-D saliency detection. ICCV, pp. 4681–4691 (2021)

  18. [18]

    IEEE TCSVT32(9), 6308–6323 (2022)

    Chen, G., Shao, F., Chai, X., Chen, H., Jiang, Q., Meng, X., Ho, Y.-S.: CGMDR- Net: Cross-guided modality difference reduction network for RGB-T salient object detection. IEEE TCSVT32(9), 6308–6323 (2022)

  19. [19]

    Cong, R., Zhang, K., Zhang, C., Zheng, F., Zhao, Y., Huang, Q., Kwong, S.: Does thermal really always matter for RGB-T salient object detection? IEEE Transac- tions on Multimedia25, 6971–6982 (2022)

  20. [20]

    IEEE TCSVT32(5), 3111– 3124 (2021)

    Huo, F., Zhu, X., Zhang, L., Liu, Q., Shu, Y.: Efficient context-guided stacked refinement network for RGB-T salient object detection. IEEE TCSVT32(5), 3111– 3124 (2021)

  21. [21]

    IEEE Transactions on Image Processing30, 5678– 5691 (2021)

    Tu, Z., Li, Z., Li, C., Lang, Y., Tang, J.: Multi-interactive dual-decoder for RGB- thermal salient object detection. IEEE Transactions on Image Processing30, 5678– 5691 (2021)

  22. [22]

    In: Proceedings of the ACM International Conference on Multi- media (ACM MM), pp

    Zhang, Z., Wang, J., Han, Y.: Saliency prototype for RGB-D and RGB-T salient object detection. In: Proceedings of the ACM International Conference on Multi- media (ACM MM), pp. 3696–3705 (2023)

  23. [23]

    IEEE Transactions on Multimedia (2024)

    Guo, R., Ying, X., Qi, Y., Qu, L.: UniTR: A unified transformer-based framework for co-object and multi-modal saliency detection. IEEE Transactions on Multimedia (2024)

  24. [24]

    IEEE Transactions on Image Processing (2024)

    Zhao, X., Liang, H., Li, P., Sun, G., Zhao, D., Liang, R., He, X.: Motion-aware memory network for fast video salient object detection. IEEE Transactions on Image Processing (2024)

  25. [25]

    ICCV, pp

    Ji, G.-P., Fu, K., Wu, Z., Fan, D.-P., Shen, J., Shao, L.: Full-duplex strategy for video object segmentation. ICCV, pp. 4922–4933 (2021)

  26. [26]

    IEEE Transactions on Neural Networks and Learning Systems (2023)

    Liu, N., Nan, K., Zhao, W., Yao, X., Han, J.: Learning complementary spa- tial–temporal transformer for video salient object detection. IEEE Transactions on Neural Networks and Learning Systems (2023)

  27. [27]

    In: 2022 IEEE International Conference on Image Process- ing (ICIP), pp

    Lu, Y., Min, D., Fu, K., Zhao, Q.: Depth-cooperated trimodal network for video salient object detection. In: 2022 IEEE International Conference on Image Process- ing (ICIP), pp. 116–120. IEEE (2022)

  28. [28]

    1–19 (2024)

    Lin,J., Zhu, L.,Shen,J., Fu,H., Zhang, Q., Wang, L.: VIDSOD-100: Anew dataset andabaselinemodelforRGB-Dvideosalientobjectdetection.InternationalJournal of Computer Vision, pp. 1–19 (2024)

  29. [29]

    IEEE Transactions on Image Processing33, 6660–6675 (2024) 14 S

    Mou, A., Lu, Y., He, J., Min, D., Fu, K., Zhao, Q.: Salient object detection in RGB-D videos. IEEE Transactions on Image Processing33, 6660–6675 (2024) 14 S. Ghosh et al

  30. [30]

    In: Proceedings of the 41st International Conference on Machine Learning (ICML), vol

    Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision Mamba: Efficient visual representation learning with bidirectional state space model. In: Proceedings of the 41st International Conference on Machine Learning (ICML), vol. 235, pp. 62429–62442. PMLR (2024)

  31. [31]

    VMamba: Visual state space model,

    Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Liu, Y.: VMamba: Visual state space model. arXiv preprint arXiv:2401.10166 (2024)

  32. [32]

    arXiv preprint arXiv:2404.04256 (2024)

    Wan,Z.,Wang,Y.,Yong,S.,Zhang,P.,Stepputtis,S.,Sycara,K.,Xie,Y.:SIGMA: Siamese Mamba network for multi-modal semantic segmentation. arXiv preprint arXiv:2404.04256 (2024)

  33. [33]

    IEEE Trans- actions on Multimedia (2025)

    Dong, W., et al.: Fusion-Mamba for cross-modality object detection. IEEE Trans- actions on Multimedia (2025)

  34. [34]

    arXiv preprint arXiv:2407.08132 (2024)

    Zhou, M., Li, T., Qiao, C., Xie, D., Wang, G., Ruan, N., Mei, L., Yang, Y.: DMM: Disparity-guided multispectral Mamba for oriented object detection in remote sens- ing. arXiv preprint arXiv:2407.08132 (2024)

  35. [35]

    arXiv preprint arXiv:2404.02668 (2024)

    Zhao, S., Chen, H., Zhang, X., Xiao, P., Bai, L., Ouyang, W.: RS-Mamba for large remote sensing image dense prediction. arXiv preprint arXiv:2404.02668 (2024)

  36. [36]

    Yang, C., Chen, Z., Espinosa, M., Ericsson, L., Wang, Z., Liu, J., Crowley, E.J.: PlainMamba: Improving non-hierarchical Mamba in visual recognition. BMVC. (2024)

  37. [37]

    In: Pro- ceedings of the IEEE/CVF CVPR, pp

    Mei, K., Delbracio, M., Talebi, H., Tu, Z., Patel, V.M., Milanfar, P.: CoDi: Condi- tional diffusion distillation for higher-fidelity and faster image generation. In: Pro- ceedings of the IEEE/CVF CVPR, pp. 9048–9058 (2024)

  38. [38]

    In: Proceedings of the IEEE/CVF ICCV, pp

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF ICCV, pp. 3836–3847 (2023)

  39. [39]

    IEEE Transactions on Neural Networks and Learning Systems, pp

    Moser, B.B., Shanbhag, A.S., Raue, F., Frolov, S., Palacio, S., Dengel, A.: Diffusion models, image super-resolution, and everything: A survey. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–21 (2024)

  40. [40]

    Gao, Y.-Q

    S.-H. Gao, Y.-Q. Tan, M.-M. Cheng, C. Lu, Y. Chen, and S. Yan. Highly effi- cient salient object detection with 100k parameters. InProceedings of the European Conference on Computer Vision (ECCV), pages 702–721. Springer, 2020

  41. [41]

    Y. Wang, R. Wang, X. Fan, T. Wang, and X. He. Pixels, regions, and objects: Multiple enhancement for salient object detection. In CVPR, 2023

  42. [42]

    Z. Luo, N. Liu, W. Zhao, X. Yang, D. Zhang, D.-P. Fan, F. Khan, and J. Han. VSCode: General visual salient and camouflaged object detection with 2D prompt learning. InProceedings of the IEEE/CVF CVPR, pages 17169–17180, 2024

  43. [43]

    J. He, K. Fu, X. Liu, and Q. Zhao. Samba: A unified Mamba-based framework for general salient object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 25314–25324, 2025

  44. [44]

    Z. Liu, Y. Tan, Q. He, and Y. Xiao. SwinNet: Swin transformer drives edge-aware RGB-D and RGB-T salient object detection.IEEE Transactions on Circuits and Systems for Video Technology, 32(7):4486–4497, 2021

  45. [45]

    K. Song, L. Huang, A. Gong, and Y. Yan. Multiple graph affinity interactive net- work and a variable illumination dataset for RGB-T image salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 33(7), 2022

  46. [46]

    L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan. Learning to detect salient objects with image-level supervision. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  47. [47]

    C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang. Saliency detection via graph- based manifold ranking. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3166–3173, 2013. DGSSM: Diffusion-Guided Mamba for Multimodal Salient Object Detection 15

  48. [48]

    Li and Y

    G. Li and Y. Yu. Visual saliency based on multi-scale deep features. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5455–5463, 2015

  49. [49]

    Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille. The secrets of salient object segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 280–287, 2014

  50. [50]

    Q. Yan, L. Xu, J. Shi, and J. Jia. Hierarchical saliency detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013

  51. [51]

    R. Ju, L. Ge, W. Geng, T. Ren, and G. Wu. Depth saliency based on anisotropic center-surround difference. InProceedings of the IEEE International Conference on Image Processing (ICIP), pages 1115–1119. IEEE, 2014

  52. [52]

    H. Peng, B. Li, W. Xiong, W. Hu, and R. Ji. RGB-D salient object detection: A benchmark and algorithms. InProceedings of the European Conference on Computer Vision (ECCV), pages 92–109. Springer, 2014

  53. [53]

    D.-P. Fan, Z. Lin, Z. Zhang, M. Zhu, and M.-M. Cheng. Rethinking RGB-D salient object detection: Models, data sets, and large-scale benchmarks.IEEE Transactions on Neural Networks and Learning Systems, 32(5):2075–2089, 2020

  54. [54]

    Y. Niu, Y. Geng, X. Li, and F. Liu. Leveraging stereopsis for saliency analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 454–461. IEEE, 2012

  55. [55]

    Y. Piao, W. Ji, J. Li, M. Zhang, and H. Lu. Depth-induced multi-scale recurrent attention network for saliency detection. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 7254–7263, 2019

  56. [56]

    G. Wang, C. Li, Y. Ma, A. Zheng, J. Tang, and B. Luo. RGB-T saliency detec- tion benchmark: Dataset, baselines, analysis and a novel approach. InProceedings of the International Conference on Intelligent Graphics and Interactive Techniques (IGTA), pages 359–369. Springer, 2018

  57. [57]

    RGB-Timagesaliencydetection via collaborative graph learning.IEEE Transactions on Multimedia, 22(1), 2019

    Z.Tu,T.Xia,C.Li,X.Wang,Y.Ma,andJ.Tang. RGB-Timagesaliencydetection via collaborative graph learning.IEEE Transactions on Multimedia, 22(1), 2019

  58. [58]

    Z. Tu, Y. Ma, Z. Li, C. Li, J. Xu, and Y. Liu. RGB-T salient object detection: A large-scale dataset and benchmark.IEEE Transactions on Multimedia, 2022

  59. [59]

    Z. Zhou, W. Pei, X. Li, H. Wang, F. Zheng, and Z. He. Saliency-associated object tracking. InProceedings of the IEEE/CVF ICCV, 2021

  60. [60]

    S. M. H. Miangoleh, Z. Bylinskii, E. Kee, E. Shechtman, and Y. Aksoy. Realistic saliency guided image enhancement. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 186–194, 2023

  61. [61]

    Jiang, X

    Y. Jiang, X. Li, K. Fu, and Q. Zhao. Transformer-based light field salient object detection and its application to autofocus.IEEE Transactions on Image Processing, 33:6647–6659, 2024

  62. [62]

    Jiang, X

    Y. Jiang, X. Yan, G.-P. Ji, K. Fu, M. Sun, H. Xiong, D.-P. Fan, and F. S. Khan. Effectiveness assessment of recent large vision-language models.Visual Intelligence, 2(1):17, 2024

  63. [63]

    Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin Transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021

  64. [64]

    Mehta and M

    S. Mehta and M. Rastegari. MobileViT: Lightweight, general-purpose, and mobile- friendlyvisiontransformer. InProceedings of the International Conference on Learn- ing Representations (ICLR), 2022

  65. [65]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023