pith. machine review for the scientific record. sign in

arxiv: 2604.03526 · v1 · submitted 2026-04-04 · 💻 cs.CV · cs.AI

Recognition: unknown

Determined by User Needs: A Salient Object Detection Rationale Beyond Conventional Visual Stimuli

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords salient object detectionUserSODuser needsproactive needsvisual attentionimage segmentationcomputer visionsaliency detection
0
0 comments X

The pith

Salient object detection should prioritize objects matching a user's proactive needs rather than visual prominence alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard salient object detection relies on passive visual stimuli, which overlooks how users often enter an image with a specific need already in mind. It introduces the UserSOD task to detect objects that align with those needs, such as locating a white apple when that is the stated requirement. This matters because ignoring needs produces results that fail to satisfy users and lead to incorrect outputs in applications like ranking objects by viewing sequence. The core barrier identified is the absence of datasets that support training and testing under this user-need-driven rationale.

Core claim

Existing SOD methods adopt a passive visual stimulus-based rationale where objects with the strongest visual stimuli are treated as salient. The paper advocates a User Salient Object Detection (UserSOD) task that instead detects salient objects aligned with users' proactive needs when such needs exist, using the example of a user seeking a white apple and therefore focusing on matching objects in the image. This shift is presented as necessary to satisfy users and enable proper development of downstream tasks such as fine-grained salient object ranking.

What carries the argument

The UserSOD task, which determines salient objects by matching them to a user's stated proactive need instead of ranking by visual stimulus strength.

If this is right

  • User satisfaction increases when detected objects match the user's pre-existing need rather than visual standout alone.
  • Salient object ranking tasks produce more accurate viewing-order analysis because ranking can incorporate need-driven focus sequences.
  • Downstream applications that depend on understanding user attention gain more reliable inputs from need-aligned detection.
  • New datasets become required to train and evaluate models that incorporate user needs as an input signal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Explicit user need statements or interaction history could serve as input to guide detection in real-time systems.
  • UserSOD could combine with other vision tasks such as object search or recommendation to create more personalized image analysis pipelines.
  • Creating synthetic or crowdsourced datasets with paired user needs and images would allow direct comparison of need-based versus stimulus-based outputs.
  • The approach implies that saliency models may need to handle cases where no user need is provided by falling back to visual stimuli.

Load-bearing premise

Users' proactive needs exist in a form that can be reliably captured and used to override or guide visual-stimuli-based saliency detection.

What would settle it

An experiment that measures user satisfaction and downstream task accuracy when applying a model trained on user-need-aligned annotations versus a standard visual-stimuli SOD model, using a test set where users first declare a specific need before viewing each image.

Figures

Figures reproduced from arXiv: 2604.03526 by Chenglizhao Chen, Luming Li, Shuai Li, Shujian Zhang, Wenfeng Song.

Figure 1
Figure 1. Figure 1: Motivation demonstration of our new task. Current [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Application of salient object detection method to salient [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The pipeline of our method. Our consists of two com [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Existing Sets v.s. Our UserSOD set. Compared to ex￾isting set, Users contains samples which are image, correspond￾ing user need commands, and corresponding ground truths (GT). Where, i⊂(1, +∞) and j⊂(1, +∞) denote the number of mask and user need command in single samples, respectively. SOD dataset, thereby it is a money-consuming and labor￾intensive. Based on these samples, we train a model capa￾ble of ut… view at source ↗
Figure 5
Figure 5. Figure 5: The proposed User Need Digger. field. Thus, give an image (I∈R H×W ) of existing sam￾ples, UND first performs Phase-1) to locate each object’s bounding box (BBi , i∈(0, +∞)) and semantic (Oi), which is achieved by OD1 -ODk . Based on BBi and Oi , UND in￾fers user need command (UNCj , j∈(1, +∞)) and the mask (Mi) of each latent target. To get Mi , UND feeds I to exist￾ing visual foundational model (VFM(·), … view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of similar feature. ( [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visual comparisons between our SOTA v.s. SOTA SOD and RIS methods. Our method not only achieves sharp contours for conventional SOD but also meets fine-grained user needs. only need to pad our model’s text input with zeros, instead of re-training our model using conventional SOD sets. Our method achieve new SOTA SOD performance (Table. 2) for the main reasons: 1) the UND (Sec. 3.1) can supple more training… view at source ↗
read the original abstract

Existing \textbf{s}alient \textbf{o}bject \textbf{d}etection (SOD) methods adopt a \textbf{passive} visual stimulus-based rationale--objects with the strongest visual stimuli are perceived as the user's primary focus (i.e., salient objects). They ignore the decisive role of users' \textbf{proactive needs} in segmenting salient objects--if a user has a need before seeing an image, the user's salient objects align with their needs, e.g., if a user's need is ``white apple'', when this user sees an image, the user's primary focus is on the ``white apple'' or ``the most white apple-like'' objects in the image. Such an oversight not only \textbf{fails to satisfy users}, but also \textbf{limits the development of downstream tasks}. For instance, in salient object ranking tasks, focusing solely on visual stimuli-based salient objects is insufficient for conducting an analysis of fine-grained relationships between users' viewing order (usually determined by user's needs) and scenes, which may result in wrong ranking results. Clearly, it is essential to detect salient objects based on user needs. Thus, we advocate a \textbf{User} \textbf{S}alient \textbf{O}bject \textbf{D}etection (UserSOD) task, which focuses on \textbf{detecting salient objects align with users' proactive needs when user have needs}. The main challenge for this new task is the lack of datasets for model training and testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript argues that conventional salient object detection (SOD) relies on a passive, visual-stimulus-driven rationale that selects objects with the strongest bottom-up cues, ignoring users' proactive needs. It introduces the User Salient Object Detection (UserSOD) task, in which saliency is determined by alignment with pre-existing user needs (illustrated by the 'white apple' example). The authors claim this shift would improve user satisfaction and enable more accurate downstream analyses such as fine-grained salient object ranking, while identifying the absence of suitable datasets as the central obstacle to progress.

Significance. If a concrete input representation and fusion mechanism for user needs were supplied, the proposal could motivate a move from purely stimulus-driven to intent-aware saliency models, with potential benefits for personalized retrieval and human-AI interaction pipelines. At present the contribution is motivational rather than technical; no empirical support, formal task definition, or dataset protocol is provided, so the significance remains prospective.

major comments (3)
  1. [Abstract] Abstract: the assertion that existing SOD methods 'fail to satisfy users' and 'limit the development of downstream tasks' is stated without any supporting user study, failure-case analysis, or citation to concrete ranking errors in the literature.
  2. [Abstract] Abstract: the UserSOD definition ('detecting salient objects align with users' proactive needs when user have needs') supplies no machine-readable input format for needs (text embedding, prior map, user profile, etc.) nor any sketch of how such input would be fused with image features, leaving the task non-operationalizable.
  3. [Abstract] Abstract: the claim that visual-stimuli-only ranking produces 'wrong ranking results' because it ignores viewing order determined by needs is not accompanied by a reference to an existing ranking method or a worked example showing the discrepancy.
minor comments (2)
  1. [Abstract] Grammar and phrasing: 'align with' should read 'aligned with'; 'when user have needs' should read 'when users have needs'.
  2. [Abstract] The statement that 'the main challenge ... is the lack of datasets' could usefully be expanded with at least a high-level annotation protocol or input specification to guide future data collection.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript advocating the UserSOD task. We address each major comment point by point below and indicate where revisions will be made to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that existing SOD methods 'fail to satisfy users' and 'limit the development of downstream tasks' is stated without any supporting user study, failure-case analysis, or citation to concrete ranking errors in the literature.

    Authors: We agree that the abstract would benefit from additional grounding. As the manuscript is primarily a position paper introducing a new task rationale and highlighting the dataset gap as the central obstacle, it does not contain a dedicated user study. In the revision we will add a short discussion with illustrative failure cases drawn from the white-apple example and cite related literature on intent-aware vision models to better support the motivation. revision: yes

  2. Referee: [Abstract] Abstract: the UserSOD definition ('detecting salient objects align with users' proactive needs when user have needs') supplies no machine-readable input format for needs (text embedding, prior map, user profile, etc.) nor any sketch of how such input would be fused with image features, leaving the task non-operationalizable.

    Authors: The current work deliberately focuses on the conceptual shift and the resulting dataset challenge rather than delivering a full technical pipeline. We acknowledge that a high-level sketch of input representation would improve clarity. In the revised manuscript we will add a brief paragraph outlining possible machine-readable formats (e.g., text embeddings of user needs) and high-level fusion strategies with visual features, while keeping the emphasis on the task definition itself. revision: partial

  3. Referee: [Abstract] Abstract: the claim that visual-stimuli-only ranking produces 'wrong ranking results' because it ignores viewing order determined by needs is not accompanied by a reference to an existing ranking method or a worked example showing the discrepancy.

    Authors: We will revise the abstract and expand the main text with a concrete worked example that contrasts stimulus-driven ranking against need-driven viewing order, showing how the resulting fine-grained analysis can differ. We will also reference representative existing salient-object-ranking methods to place the claim in context. revision: yes

Circularity Check

0 steps flagged

No circularity: definitional proposal of new task without equations or self-referential reductions

full rationale

The manuscript advocates UserSOD as a new task motivated by the claim that conventional SOD is passive and ignores proactive user needs. No equations, parameter fits, or derivations appear in the provided text. The central statement simply defines the new task in terms of the identified gap ('detecting salient objects align with users' proactive needs when user have needs') without reducing it to any fitted input, self-citation chain, or renamed known result. The argument is therefore self-contained as a motivation for future dataset creation rather than a closed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The proposal rests on the domain assumption that proactive user needs are the decisive factor for saliency and that current visual-stimuli methods are insufficient. No free parameters or invented physical entities; the new entity is the task itself.

axioms (1)
  • domain assumption Users' proactive needs determine salient objects more accurately and usefully than visual stimuli alone.
    Invoked throughout the abstract as the justification for introducing UserSOD.
invented entities (1)
  • UserSOD task no independent evidence
    purpose: To detect salient objects that align with users' proactive needs
    Newly defined task without accompanying dataset or validation.

pith-pipeline@v0.9.0 · 5586 in / 1267 out tokens · 43323 ms · 2026-05-13T18:59:00.705895+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages

  1. [1]

    Frequency-tuned salient region detection

    Radhakrishna Achanta, Sheila Hemami, and Francisco Estrada. Frequency-tuned salient region detection. InIEEE CVPR, pages 1597–1604, 2009. 5

  2. [2]

    Training-free open- vocabulary segmentation with offline diffusion-augmented prototype generation

    Luca Barsellotti, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Training-free open- vocabulary segmentation with offline diffusion-augmented prototype generation. InIEEE CVPR, pages 3689–3698,

  3. [3]

    Grounding everything: Emerging localization properties in vision-language transformers

    Walid Bousselham, Felix Petersen, Vittorio Ferrari, and Hilde Kuehne. Grounding everything: Emerging localization properties in vision-language transformers. InIEEE CVPR, pages 3828–3837, 2024. 6

  4. [4]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InNACACL, pages 4171–4186, 2019. 8

  5. [5]

    An image is worth 16x16 words: Trans- formers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2021. 8

  6. [6]

    Human-like object concept representations emerge naturally in multimodal large language models.Na- ture Machine Intelligence, pages 860–875, 2025

    Changde Du, Kaicheng Fu, Bincheng Wen, Yi Sun, Jie Peng, Wei Wei, Ying Gao, Shengpei Wang, Chuncheng Zhang, and Jinpeng Li. Human-like object concept representations emerge naturally in multimodal large language models.Na- ture Machine Intelligence, pages 860–875, 2025. 3

  7. [7]

    D. Fan, C. Gong, Y . Cao, B. Ren, M. Cheng, and A. Borji. Enhanced-alignment measure for binary foreground map evaluation. InIJCAI, pages 698–704, 2018. 5

  8. [8]

    D. Fan, M. Cheng, and Y . Liu. Structure-measure: A new way to evaluate foreground maps.International Journal of Computer Vision, pages 2622—2638, 2021. 5

  9. [9]

    Seqrank: Sequential ranking of salient objects

    Huankang Guan and Rynson WH Lau. Seqrank: Sequential ranking of salient objects. InAAAI, pages 1941–1949, 2024. 6

  10. [10]

    Deep learning-based image retrieval with unsupervised double bit hashing.IEEE TCSVT, pages 7050–7065, 2023

    Jing-Ming Guo, Alim Wicaksono Hari Prayuda, Heri Prase- tyo, and Sankarasrinivasan Seshathiri. Deep learning-based image retrieval with unsupervised double bit hashing.IEEE TCSVT, pages 7050–7065, 2023. 1

  11. [11]

    Cnns-based rgb-d saliency detection via cross-view transfer and multiview fu- sion.IEEE Transactions on Cybernetics, pages 3171–3183,

    Junwei Han, Hao Chen, and Nian Liu. Cnns-based rgb-d saliency detection via cross-view transfer and multiview fu- sion.IEEE Transactions on Cybernetics, pages 3171–3183,

  12. [12]

    Modeling context in refer- ring expressions

    Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. Modeling context in refer- ring expressions. InECCV, pages 69–85, 2016. 7

  13. [13]

    Densely connected parameter- efficient tuning for referring image segmentation

    Jiaqi Huang, Zunnan Xu, Ting Liu, Yong Liu, Haonan Han, Kehong Yuan, and Xiu Li. Densely connected parameter- efficient tuning for referring image segmentation. InAAAI, pages 3653–3661, 2025. 6

  14. [14]

    Distilling spectral graph for object-context aware pen-vocabulary semantic segmen- tation

    Chanyoung Kim, Dayun Ju, Woojung Han, Ming-Hsuan Yang, and Seong Jae Hwang. Distilling spectral graph for object-context aware pen-vocabulary semantic segmen- tation. InIEEE CVPR, pages 15033–15042, 2025. 6

  15. [15]

    Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything. pages 4015–4026, 2023. 4

  16. [16]

    Clearclip: Decom- posing clip representations for dense vision-language infer- ence

    Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Clearclip: Decom- posing clip representations for dense vision-language infer- ence. InECCV, pages 143–160, 2024. 6

  17. [17]

    Visual saliency based on multi- scale deep features

    Guanbin Li and Yizhou Yu. Visual saliency based on multi- scale deep features. InIEEE CVPR, pages 5455–5463, 2015. 6

  18. [18]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, pages 12888–12900, 2022. 3, 4

  19. [19]

    Rehg, and Alan L

    Yin Li, Xiaodi Hou, Christof Koch, James M. Rehg, and Alan L. Yuille. The secrets of salient object segmentation. InIEEE CVPR, pages 280–287, 2014. 6

  20. [20]

    Flip: Scaling language- image pre-training via masking

    Yanghao Li, Hao Li, Wittawat Jitkrittum, Jiquan Yang, Han Xu, Hu Xu, and Xiaolong Wang. Flip: Scaling language- image pre-training via masking. InIEEE PAMI, pages 23123–23134, 2023. 3

  21. [21]

    Instance-level relative saliency ranking with graph rea- soning.IEEE TPAMI, pages 8321–8337, 2021

    Nian Liu, Long Li, Wangbo Zhao, Junwei Han, and Ling Shao. Instance-level relative saliency ranking with graph rea- soning.IEEE TPAMI, pages 8321–8337, 2021. 6

  22. [22]

    Visual saliency transformer

    Nian Liu, Ni Zhang, Kaiyuan Wan, Ling Shao, and Junwei Han. Visual saliency transformer. InIEEE ICCV, pages 4722–4732, 2021. 2, 8

  23. [23]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE ICCV, pages 10012–10022, 2021. 8

  24. [24]

    Segclip: Patch aggregation with learn- able centers for open-vocabulary semantic segmentation

    Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, and Tianrui Li. Segclip: Patch aggregation with learn- able centers for open-vocabulary semantic segmentation. In ICML, pages 23033–23044, 2023. 6

  25. [25]

    Generative transformer for accurate and reliable salient object detection.IEEE TCSVT, pages 1041–1054, 2025

    Yuxin Mao, Jing Zhang, Zhexiong Wan, Xinyu Tian, Aixuan Li, Yunqiu Lv, and Yuchao Dai. Generative transformer for accurate and reliable salient object detection.IEEE TCSVT, pages 1041–1054, 2025. 6

  26. [26]

    Vision-aware text features in referring image segmentation: From object understanding to context understanding

    Hai Nguyen-Truong, E-Ro Nguyen, Tuan-Anh Vu, Minh- Triet Tran, Binh-Son Hua, and Sai-Kit Yeung. Vision-aware text features in referring image segmentation: From object understanding to context understanding. InIEEE WACV, pages 4988–4998, 2025. 6 9

  27. [27]

    Eov-seg: Efficient open-vocabulary panoptic segmentation

    Hongwei Niu, Jie Hu, Jianghang Lin, Guannan Jiang, and Shengchuan Zhang. Eov-seg: Efficient open-vocabulary panoptic segmentation. InAAAI, pages 6254–6262, 2025. 6

  28. [28]

    Chatgpt, 2023

    OpenAI. Chatgpt, 2023. 4

  29. [29]

    Multi-scale in- teractive network for salient object detection

    Youwei Pang, Xiaoqi Zhao, and Lihe Zhang. Multi-scale in- teractive network for salient object detection. InIEEE Con- ference on Computer Vision and Pattern Recognition, pages 9410–9419, 2020. 1, 2

  30. [30]

    Hypersor: Context-aware graph hypernetwork for salient object ranking.IEEE TPAMI, pages 5873–5889, 2024

    Minglang Qiao, Mai Xu, Lai Jiang, Peng Lei, Shijie Wen, Yunjin Chen, and Leonid Sigal. Hypersor: Context-aware graph hypernetwork for salient object ranking.IEEE TPAMI, pages 5873–5889, 2024. 1

  31. [31]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, pages 8748–8763, 2021. 3

  32. [32]

    Unifying global-local representations in salient object detection with transformers.IEEE TETCI, pages 1–10, 2024

    Sucheng Ren, Nanxuan Zhao, Qiang Wen, Guoqiang Han, and Shengfeng He. Unifying global-local representations in salient object detection with transformers.IEEE TETCI, pages 1–10, 2024. 6

  33. [33]

    Pushing the boundaries of salient object detection: A denoising-driven approach.IEEE TIP, 34:3903–3917, 2025

    Mengke Song, Luming Li, Xu Yu, and Chenglizhao Chen. Pushing the boundaries of salient object detection: A denoising-driven approach.IEEE TIP, 34:3903–3917, 2025. 6

  34. [34]

    Partitioned saliency ranking with dense pyramid trans- formers

    Chengxiao Sun, Yan Xu, Jialun Pei, Haopeng Fang, and He Tang. Partitioned saliency ranking with dense pyramid trans- formers. InACM MM, pages 1874–1883, 2023. 6

  35. [35]

    Conditional diffusion models for cam- ouflaged and salient object detection.IEEE TPAMI, pages 2833–2848, 2025

    Ke Sun, Zhongxi Chen, Xianming Lin, Xiaoshuai Sun, Hong Liu, and Rongrong Ji. Conditional diffusion models for cam- ouflaged and salient object detection.IEEE TPAMI, pages 2833–2848, 2025. 3, 6

  36. [36]

    Con- trastive grouping with transformer for referring image seg- mentation

    Jiajin Tang, Ge Zheng, Cheng Shi, and Sibei Yang. Con- trastive grouping with transformer for referring image seg- mentation. InIEEE CVPR, pages 23570–23580, 2023. 6

  37. [37]

    Bi-directional object-context prioritization learning for saliency ranking

    Xin Tian, Ke Xu, Xin Yang, Lin Du, Baocai Yin, and Ryn- son WH Lau. Bi-directional object-context prioritization learning for saliency ranking. InIEEE CVPR, pages 5882– 5891, 2022. 6

  38. [38]

    Yolov8 documentation, 2023

    Ultralytics. Yolov8 documentation, 2023. 3, 4

  39. [39]

    Sclip: Rethinking self-attention for dense vision-language inference

    Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Rethinking self-attention for dense vision-language inference. InECCV, pages 315–332, 2025. 8

  40. [40]

    Declip: Decoupled learning for open- vocabulary dense perception

    Junjie Wang, Bin Chen, Yulin Li, Bin Kang, Yichi Chen, and Zhuotao Tian. Declip: Decoupled learning for open- vocabulary dense perception. InIEEE CVPR, pages 14824– 14834, 2025. 8

  41. [41]

    Learning to de- tect salient objects with image-level supervision

    Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng, Dong Wang, Baocai Yin, and Xiang Ruan. Learning to de- tect salient objects with image-level supervision. InIEEE CVPR, pages 3796–3805, 2017. 6

  42. [42]

    Pixels, regions, and objects: Multiple enhance- ment for salient object detection

    Yi Wang, Ruili Wang, Xin Fan, Tianzhu Wang, and Xi- angjian He. Pixels, regions, and objects: Multiple enhance- ment for salient object detection. InIEEE CVPR, pages 10031–10040, 2023. 2

  43. [43]

    Iterprime: Zero-shot referring image segmen- tation with iterative grad-cam refinement and primary word emphasis

    Yuji Wang, Jingchen Ni, Yong Liu, Chun Yuan, and Yan- song Tang. Iterprime: Zero-shot referring image segmen- tation with iterative grad-cam refinement and primary word emphasis. InAAAI, pages 8159–8168, 2025. 6

  44. [44]

    Medsegdiff-v2: Diffusion-based medical image segmentation with transformer

    Junde Wu, Wei Ji, Huazhu Fu, Min Xu, Yueming Jin, and Yanwu Xu. Medsegdiff-v2: Diffusion-based medical image segmentation with transformer. InAAAI, pages 6030–6038,

  45. [45]

    Mobilesal: Extremely efficient rgb-d salient object detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 10261– 10269, 2022

    Yu-Huan Wu, Yun Liu, and Jun Xu. Mobilesal: Extremely efficient rgb-d salient object detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 10261– 10269, 2022. 1

  46. [46]

    Edn: Salient object detection via extremely-downsampled network.IEEE Trans- actions on Image Processing, pages 3125–3136, 2022

    Yu-Huan Wu, Yun Liu, and Le Zhang. Edn: Salient object detection via extremely-downsampled network.IEEE Trans- actions on Image Processing, pages 3125–3136, 2022. 2

  47. [47]

    Gsva: Generalized segmentation via multimodal large language models

    Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InIEEE CVPR, pages 3858–3869, 2024. 6

  48. [48]

    Bridging vision and language encoders: Parameter-efficient tuning for referring image seg- mentation

    Zunnan Xu, Zhihong Chen, Yong Zhang, Yibing Song, Xi- ang Wan, and Guanbin Li. Bridging vision and language encoders: Parameter-efficient tuning for referring image seg- mentation. InIEEE CVPR, pages 17503–17512, 2023. 6

  49. [49]

    Generalized decoding for pixel, im- age and language

    Zi-Yi Dou Xueyan Zou. Generalized decoding for pixel, im- age and language. pages 15116–15127. IEEE CVPR, 2023. 6

  50. [50]

    Hierarchical saliency detection

    Qiong Yan, Li Xu, Jianping Shi, and Jiaya Jia. Hierarchical saliency detection. InIEEE CVPR, pages 1155–1162, 2013. 6

  51. [51]

    Remamber: Referring image segmentation with mamba twister.ECCV, pages 108–126,

    Yuhuan Yang, Chaofan Ma, Jiangchao Yao, Zhun Zhong, Ya Zhang, and Yanfeng Wang. Remamber: Referring image segmentation with mamba twister.ECCV, pages 108–126,

  52. [52]

    Evaluating salient object detection in natural images with multiple objects having multi-level saliency.IET Image Processing, 14(10):2249–2262, 2020

    G ¨okhan Yildirim, Debashis Sen, Mohan Kankanhalli, and Sabine S ¨usstrunk. Evaluating salient object detection in natural images with multiple objects having multi-level saliency.IET Image Processing, 14(10):2249–2262, 2020. 7

  53. [53]

    Unified unsupervised salient object detection via knowledge transfer

    Yao Yuan, Wutao Liu, Pan Gao, Qun Dai, and Jie Qin. Unified unsupervised salient object detection via knowledge transfer. InIJCAI, pages 1616–1624, 2024. 6

  54. [54]

    Adaptive se- lection based referring image segmentation

    Pengfei Yue, Jianghang Lin, Shengchuan Zhang, Jie Hu, Yilin Lu, Hongwei Niu, Haixin Ding, Yan Zhang, GUAN- NAN JIANG, Liujuan Cao, and Rongrong Ji. Adaptive se- lection based referring image segmentation. InACM MM, pages 1101–1110, 2024. 6

  55. [55]

    Position fusing and refining for clear salient object detection.IEEE TNNLS, pages 4019–4028, 2025

    Xing Zhao, Haoran Liang, and Ronghua Liang. Position fusing and refining for clear salient object detection.IEEE TNNLS, pages 4019–4028, 2025. 6

  56. [56]

    Texture-guided saliency distilling for unsuper- vised salient object detection

    Huajun Zhou, Bo Qiao, Lingxiao Yang, Jianhuang Lai, and Xiaohua Xie. Texture-guided saliency distilling for unsuper- vised salient object detection. InIEEE CVPR, pages 7257– 7267, 2023. 2

  57. [57]

    Priornet: Two deep prior cues for salient object detection.IEEE TMM, pages 5523–5535, 2024

    Ge Zhu, Jinbao Li, and Yahong Guo. Priornet: Two deep prior cues for salient object detection.IEEE TMM, pages 5523–5535, 2024. 6 10

  58. [58]

    Mlst-former: Multi-level spatial-temporal trans- former for group activity recognition.IEEE TCSVT, pages 3383–3397, 2023

    Xiaolin Zhu, Yan Zhou, Dongli Wang, Wanli Ouyang, and Rui Su. Mlst-former: Multi-level spatial-temporal trans- former for group activity recognition.IEEE TCSVT, pages 3383–3397, 2023. 1 11