arxiv: 2604.03526 · v1 · submitted 2026-04-04 · 💻 cs.CV · cs.AI

Recognition: unknown

Determined by User Needs: A Salient Object Detection Rationale Beyond Conventional Visual Stimuli

Chenglizhao Chen , Shujian Zhang , Luming Li , Wenfeng Song , Shuai Li

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords salient object detectionUserSODuser needsproactive needsvisual attentionimage segmentationcomputer visionsaliency detection

0 comments

The pith

Salient object detection should prioritize objects matching a user's proactive needs rather than visual prominence alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard salient object detection relies on passive visual stimuli, which overlooks how users often enter an image with a specific need already in mind. It introduces the UserSOD task to detect objects that align with those needs, such as locating a white apple when that is the stated requirement. This matters because ignoring needs produces results that fail to satisfy users and lead to incorrect outputs in applications like ranking objects by viewing sequence. The core barrier identified is the absence of datasets that support training and testing under this user-need-driven rationale.

Core claim

Existing SOD methods adopt a passive visual stimulus-based rationale where objects with the strongest visual stimuli are treated as salient. The paper advocates a User Salient Object Detection (UserSOD) task that instead detects salient objects aligned with users' proactive needs when such needs exist, using the example of a user seeking a white apple and therefore focusing on matching objects in the image. This shift is presented as necessary to satisfy users and enable proper development of downstream tasks such as fine-grained salient object ranking.

What carries the argument

The UserSOD task, which determines salient objects by matching them to a user's stated proactive need instead of ranking by visual stimulus strength.

If this is right

User satisfaction increases when detected objects match the user's pre-existing need rather than visual standout alone.
Salient object ranking tasks produce more accurate viewing-order analysis because ranking can incorporate need-driven focus sequences.
Downstream applications that depend on understanding user attention gain more reliable inputs from need-aligned detection.
New datasets become required to train and evaluate models that incorporate user needs as an input signal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Explicit user need statements or interaction history could serve as input to guide detection in real-time systems.
UserSOD could combine with other vision tasks such as object search or recommendation to create more personalized image analysis pipelines.
Creating synthetic or crowdsourced datasets with paired user needs and images would allow direct comparison of need-based versus stimulus-based outputs.
The approach implies that saliency models may need to handle cases where no user need is provided by falling back to visual stimuli.

Load-bearing premise

Users' proactive needs exist in a form that can be reliably captured and used to override or guide visual-stimuli-based saliency detection.

What would settle it

An experiment that measures user satisfaction and downstream task accuracy when applying a model trained on user-need-aligned annotations versus a standard visual-stimuli SOD model, using a test set where users first declare a specific need before viewing each image.

Figures

Figures reproduced from arXiv: 2604.03526 by Chenglizhao Chen, Luming Li, Shuai Li, Shujian Zhang, Wenfeng Song.

**Figure 2.** Figure 2: Application of salient object detection method to salient [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The pipeline of our method. Our consists of two com [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Existing Sets v.s. Our UserSOD set. Compared to existing set, Users contains samples which are image, corresponding user need commands, and corresponding ground truths (GT). Where, i⊂(1, +∞) and j⊂(1, +∞) denote the number of mask and user need command in single samples, respectively. SOD dataset, thereby it is a money-consuming and laborintensive. Based on these samples, we train a model capable of ut… view at source ↗

**Figure 5.** Figure 5: The proposed User Need Digger. field. Thus, give an image (I∈R H×W ) of existing samples, UND first performs Phase-1) to locate each object’s bounding box (BBi , i∈(0, +∞)) and semantic (Oi), which is achieved by OD1 -ODk . Based on BBi and Oi , UND infers user need command (UNCj , j∈(1, +∞)) and the mask (Mi) of each latent target. To get Mi , UND feeds I to existing visual foundational model (VFM(·), … view at source ↗

**Figure 7.** Figure 7: Visualization of similar feature. ( [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: Visual comparisons between our SOTA v.s. SOTA SOD and RIS methods. Our method not only achieves sharp contours for conventional SOD but also meets fine-grained user needs. only need to pad our model’s text input with zeros, instead of re-training our model using conventional SOD sets. Our method achieve new SOTA SOD performance (Table. 2) for the main reasons: 1) the UND (Sec. 3.1) can supple more training… view at source ↗

read the original abstract

Existing \textbf{s}alient \textbf{o}bject \textbf{d}etection (SOD) methods adopt a \textbf{passive} visual stimulus-based rationale--objects with the strongest visual stimuli are perceived as the user's primary focus (i.e., salient objects). They ignore the decisive role of users' \textbf{proactive needs} in segmenting salient objects--if a user has a need before seeing an image, the user's salient objects align with their needs, e.g., if a user's need is ``white apple'', when this user sees an image, the user's primary focus is on the ``white apple'' or ``the most white apple-like'' objects in the image. Such an oversight not only \textbf{fails to satisfy users}, but also \textbf{limits the development of downstream tasks}. For instance, in salient object ranking tasks, focusing solely on visual stimuli-based salient objects is insufficient for conducting an analysis of fine-grained relationships between users' viewing order (usually determined by user's needs) and scenes, which may result in wrong ranking results. Clearly, it is essential to detect salient objects based on user needs. Thus, we advocate a \textbf{User} \textbf{S}alient \textbf{O}bject \textbf{D}etection (UserSOD) task, which focuses on \textbf{detecting salient objects align with users' proactive needs when user have needs}. The main challenge for this new task is the lack of datasets for model training and testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper proposes reframing SOD around explicit user needs instead of visual saliency alone, but supplies no input format, dataset, or method to make the idea operational.

read the letter

The central move here is to treat user intent as the primary driver for what counts as salient, rather than the strongest visual pop-out. They illustrate it with the white-apple example and argue that current methods will produce wrong results on downstream tasks like fine-grained ranking because those tasks depend on viewing order set by needs, not just stimulus strength. That framing is new relative to the SOD papers they cite, and the observation that existing benchmarks ignore proactive intent is fair as far as it goes. They also correctly flag the immediate blocker: no paired data exists for training or testing such a model. That is a concrete, useful statement of the problem. The weakness is that the proposal stops at the definition. No representation is given for how a need would be encoded (text embedding, click prior, profile vector, etc.), no fusion mechanism is sketched, and no evidence is offered that current SOD outputs actually dissatisfy users or degrade ranking performance. The assertions about failure and limitation are stated without even a single qualitative counter-example or small-scale check. As a result the claim that UserSOD will improve satisfaction or downstream work remains an untested hypothesis rather than a demonstrated direction. This is the kind of short position piece that could usefully start a conversation in a SOD or user-centric vision workshop, but it is too thin on mechanics and evidence to stand as a full conference paper without major additions. I would send it to review with the expectation that the authors either supply a concrete input protocol and toy dataset or narrow the scope to a pure task-definition note.

Referee Report

3 major / 2 minor

Summary. The manuscript argues that conventional salient object detection (SOD) relies on a passive, visual-stimulus-driven rationale that selects objects with the strongest bottom-up cues, ignoring users' proactive needs. It introduces the User Salient Object Detection (UserSOD) task, in which saliency is determined by alignment with pre-existing user needs (illustrated by the 'white apple' example). The authors claim this shift would improve user satisfaction and enable more accurate downstream analyses such as fine-grained salient object ranking, while identifying the absence of suitable datasets as the central obstacle to progress.

Significance. If a concrete input representation and fusion mechanism for user needs were supplied, the proposal could motivate a move from purely stimulus-driven to intent-aware saliency models, with potential benefits for personalized retrieval and human-AI interaction pipelines. At present the contribution is motivational rather than technical; no empirical support, formal task definition, or dataset protocol is provided, so the significance remains prospective.

major comments (3)

[Abstract] Abstract: the assertion that existing SOD methods 'fail to satisfy users' and 'limit the development of downstream tasks' is stated without any supporting user study, failure-case analysis, or citation to concrete ranking errors in the literature.
[Abstract] Abstract: the UserSOD definition ('detecting salient objects align with users' proactive needs when user have needs') supplies no machine-readable input format for needs (text embedding, prior map, user profile, etc.) nor any sketch of how such input would be fused with image features, leaving the task non-operationalizable.
[Abstract] Abstract: the claim that visual-stimuli-only ranking produces 'wrong ranking results' because it ignores viewing order determined by needs is not accompanied by a reference to an existing ranking method or a worked example showing the discrepancy.

minor comments (2)

[Abstract] Grammar and phrasing: 'align with' should read 'aligned with'; 'when user have needs' should read 'when users have needs'.
[Abstract] The statement that 'the main challenge ... is the lack of datasets' could usefully be expanded with at least a high-level annotation protocol or input specification to guide future data collection.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript advocating the UserSOD task. We address each major comment point by point below and indicate where revisions will be made to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that existing SOD methods 'fail to satisfy users' and 'limit the development of downstream tasks' is stated without any supporting user study, failure-case analysis, or citation to concrete ranking errors in the literature.

Authors: We agree that the abstract would benefit from additional grounding. As the manuscript is primarily a position paper introducing a new task rationale and highlighting the dataset gap as the central obstacle, it does not contain a dedicated user study. In the revision we will add a short discussion with illustrative failure cases drawn from the white-apple example and cite related literature on intent-aware vision models to better support the motivation. revision: yes
Referee: [Abstract] Abstract: the UserSOD definition ('detecting salient objects align with users' proactive needs when user have needs') supplies no machine-readable input format for needs (text embedding, prior map, user profile, etc.) nor any sketch of how such input would be fused with image features, leaving the task non-operationalizable.

Authors: The current work deliberately focuses on the conceptual shift and the resulting dataset challenge rather than delivering a full technical pipeline. We acknowledge that a high-level sketch of input representation would improve clarity. In the revised manuscript we will add a brief paragraph outlining possible machine-readable formats (e.g., text embeddings of user needs) and high-level fusion strategies with visual features, while keeping the emphasis on the task definition itself. revision: partial
Referee: [Abstract] Abstract: the claim that visual-stimuli-only ranking produces 'wrong ranking results' because it ignores viewing order determined by needs is not accompanied by a reference to an existing ranking method or a worked example showing the discrepancy.

Authors: We will revise the abstract and expand the main text with a concrete worked example that contrasts stimulus-driven ranking against need-driven viewing order, showing how the resulting fine-grained analysis can differ. We will also reference representative existing salient-object-ranking methods to place the claim in context. revision: yes

Circularity Check

0 steps flagged

No circularity: definitional proposal of new task without equations or self-referential reductions

full rationale

The manuscript advocates UserSOD as a new task motivated by the claim that conventional SOD is passive and ignores proactive user needs. No equations, parameter fits, or derivations appear in the provided text. The central statement simply defines the new task in terms of the identified gap ('detecting salient objects align with users' proactive needs when user have needs') without reducing it to any fitted input, self-citation chain, or renamed known result. The argument is therefore self-contained as a motivation for future dataset creation rather than a closed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The proposal rests on the domain assumption that proactive user needs are the decisive factor for saliency and that current visual-stimuli methods are insufficient. No free parameters or invented physical entities; the new entity is the task itself.

axioms (1)

domain assumption Users' proactive needs determine salient objects more accurately and usefully than visual stimuli alone.
Invoked throughout the abstract as the justification for introducing UserSOD.

invented entities (1)

UserSOD task no independent evidence
purpose: To detect salient objects that align with users' proactive needs
Newly defined task without accompanying dataset or validation.

pith-pipeline@v0.9.0 · 5586 in / 1267 out tokens · 43323 ms · 2026-05-13T18:59:00.705895+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages

[1]

Frequency-tuned salient region detection

Radhakrishna Achanta, Sheila Hemami, and Francisco Estrada. Frequency-tuned salient region detection. InIEEE CVPR, pages 1597–1604, 2009. 5

work page 2009
[2]

Training-free open- vocabulary segmentation with offline diffusion-augmented prototype generation

Luca Barsellotti, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Training-free open- vocabulary segmentation with offline diffusion-augmented prototype generation. InIEEE CVPR, pages 3689–3698,

work page
[3]

Grounding everything: Emerging localization properties in vision-language transformers

Walid Bousselham, Felix Petersen, Vittorio Ferrari, and Hilde Kuehne. Grounding everything: Emerging localization properties in vision-language transformers. InIEEE CVPR, pages 3828–3837, 2024. 6

work page 2024
[4]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InNACACL, pages 4171–4186, 2019. 8

work page 2019
[5]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2021. 8

work page 2021
[6]

Human-like object concept representations emerge naturally in multimodal large language models.Na- ture Machine Intelligence, pages 860–875, 2025

Changde Du, Kaicheng Fu, Bincheng Wen, Yi Sun, Jie Peng, Wei Wei, Ying Gao, Shengpei Wang, Chuncheng Zhang, and Jinpeng Li. Human-like object concept representations emerge naturally in multimodal large language models.Na- ture Machine Intelligence, pages 860–875, 2025. 3

work page 2025
[7]

D. Fan, C. Gong, Y . Cao, B. Ren, M. Cheng, and A. Borji. Enhanced-alignment measure for binary foreground map evaluation. InIJCAI, pages 698–704, 2018. 5

work page 2018
[8]

D. Fan, M. Cheng, and Y . Liu. Structure-measure: A new way to evaluate foreground maps.International Journal of Computer Vision, pages 2622—2638, 2021. 5

work page 2021
[9]

Seqrank: Sequential ranking of salient objects

Huankang Guan and Rynson WH Lau. Seqrank: Sequential ranking of salient objects. InAAAI, pages 1941–1949, 2024. 6

work page 1941
[10]

Deep learning-based image retrieval with unsupervised double bit hashing.IEEE TCSVT, pages 7050–7065, 2023

Jing-Ming Guo, Alim Wicaksono Hari Prayuda, Heri Prase- tyo, and Sankarasrinivasan Seshathiri. Deep learning-based image retrieval with unsupervised double bit hashing.IEEE TCSVT, pages 7050–7065, 2023. 1

work page 2023
[11]

Cnns-based rgb-d saliency detection via cross-view transfer and multiview fu- sion.IEEE Transactions on Cybernetics, pages 3171–3183,

Junwei Han, Hao Chen, and Nian Liu. Cnns-based rgb-d saliency detection via cross-view transfer and multiview fu- sion.IEEE Transactions on Cybernetics, pages 3171–3183,

work page
[12]

Modeling context in refer- ring expressions

Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. Modeling context in refer- ring expressions. InECCV, pages 69–85, 2016. 7

work page 2016
[13]

Densely connected parameter- efficient tuning for referring image segmentation

Jiaqi Huang, Zunnan Xu, Ting Liu, Yong Liu, Haonan Han, Kehong Yuan, and Xiu Li. Densely connected parameter- efficient tuning for referring image segmentation. InAAAI, pages 3653–3661, 2025. 6

work page 2025
[14]

Distilling spectral graph for object-context aware pen-vocabulary semantic segmen- tation

Chanyoung Kim, Dayun Ju, Woojung Han, Ming-Hsuan Yang, and Seong Jae Hwang. Distilling spectral graph for object-context aware pen-vocabulary semantic segmen- tation. InIEEE CVPR, pages 15033–15042, 2025. 6

work page 2025
[15]

Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything. pages 4015–4026, 2023. 4

work page 2023
[16]

Clearclip: Decom- posing clip representations for dense vision-language infer- ence

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Clearclip: Decom- posing clip representations for dense vision-language infer- ence. InECCV, pages 143–160, 2024. 6

work page 2024
[17]

Visual saliency based on multi- scale deep features

Guanbin Li and Yizhou Yu. Visual saliency based on multi- scale deep features. InIEEE CVPR, pages 5455–5463, 2015. 6

work page 2015
[18]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, pages 12888–12900, 2022. 3, 4

work page 2022
[19]

Rehg, and Alan L

Yin Li, Xiaodi Hou, Christof Koch, James M. Rehg, and Alan L. Yuille. The secrets of salient object segmentation. InIEEE CVPR, pages 280–287, 2014. 6

work page 2014
[20]

Flip: Scaling language- image pre-training via masking

Yanghao Li, Hao Li, Wittawat Jitkrittum, Jiquan Yang, Han Xu, Hu Xu, and Xiaolong Wang. Flip: Scaling language- image pre-training via masking. InIEEE PAMI, pages 23123–23134, 2023. 3

work page 2023
[21]

Instance-level relative saliency ranking with graph rea- soning.IEEE TPAMI, pages 8321–8337, 2021

Nian Liu, Long Li, Wangbo Zhao, Junwei Han, and Ling Shao. Instance-level relative saliency ranking with graph rea- soning.IEEE TPAMI, pages 8321–8337, 2021. 6

work page 2021
[22]

Visual saliency transformer

Nian Liu, Ni Zhang, Kaiyuan Wan, Ling Shao, and Junwei Han. Visual saliency transformer. InIEEE ICCV, pages 4722–4732, 2021. 2, 8

work page 2021
[23]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE ICCV, pages 10012–10022, 2021. 8

work page 2021
[24]

Segclip: Patch aggregation with learn- able centers for open-vocabulary semantic segmentation

Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, and Tianrui Li. Segclip: Patch aggregation with learn- able centers for open-vocabulary semantic segmentation. In ICML, pages 23033–23044, 2023. 6

work page 2023
[25]

Generative transformer for accurate and reliable salient object detection.IEEE TCSVT, pages 1041–1054, 2025

Yuxin Mao, Jing Zhang, Zhexiong Wan, Xinyu Tian, Aixuan Li, Yunqiu Lv, and Yuchao Dai. Generative transformer for accurate and reliable salient object detection.IEEE TCSVT, pages 1041–1054, 2025. 6

work page 2025
[26]

Vision-aware text features in referring image segmentation: From object understanding to context understanding

Hai Nguyen-Truong, E-Ro Nguyen, Tuan-Anh Vu, Minh- Triet Tran, Binh-Son Hua, and Sai-Kit Yeung. Vision-aware text features in referring image segmentation: From object understanding to context understanding. InIEEE WACV, pages 4988–4998, 2025. 6 9

work page 2025
[27]

Eov-seg: Efficient open-vocabulary panoptic segmentation

Hongwei Niu, Jie Hu, Jianghang Lin, Guannan Jiang, and Shengchuan Zhang. Eov-seg: Efficient open-vocabulary panoptic segmentation. InAAAI, pages 6254–6262, 2025. 6

work page 2025
[28]

Chatgpt, 2023

OpenAI. Chatgpt, 2023. 4

work page 2023
[29]

Multi-scale in- teractive network for salient object detection

Youwei Pang, Xiaoqi Zhao, and Lihe Zhang. Multi-scale in- teractive network for salient object detection. InIEEE Con- ference on Computer Vision and Pattern Recognition, pages 9410–9419, 2020. 1, 2

work page 2020
[30]

Hypersor: Context-aware graph hypernetwork for salient object ranking.IEEE TPAMI, pages 5873–5889, 2024

Minglang Qiao, Mai Xu, Lai Jiang, Peng Lei, Shijie Wen, Yunjin Chen, and Leonid Sigal. Hypersor: Context-aware graph hypernetwork for salient object ranking.IEEE TPAMI, pages 5873–5889, 2024. 1

work page 2024
[31]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, pages 8748–8763, 2021. 3

work page 2021
[32]

Unifying global-local representations in salient object detection with transformers.IEEE TETCI, pages 1–10, 2024

Sucheng Ren, Nanxuan Zhao, Qiang Wen, Guoqiang Han, and Shengfeng He. Unifying global-local representations in salient object detection with transformers.IEEE TETCI, pages 1–10, 2024. 6

work page 2024
[33]

Pushing the boundaries of salient object detection: A denoising-driven approach.IEEE TIP, 34:3903–3917, 2025

Mengke Song, Luming Li, Xu Yu, and Chenglizhao Chen. Pushing the boundaries of salient object detection: A denoising-driven approach.IEEE TIP, 34:3903–3917, 2025. 6

work page 2025
[34]

Partitioned saliency ranking with dense pyramid trans- formers

Chengxiao Sun, Yan Xu, Jialun Pei, Haopeng Fang, and He Tang. Partitioned saliency ranking with dense pyramid trans- formers. InACM MM, pages 1874–1883, 2023. 6

work page 2023
[35]

Conditional diffusion models for cam- ouflaged and salient object detection.IEEE TPAMI, pages 2833–2848, 2025

Ke Sun, Zhongxi Chen, Xianming Lin, Xiaoshuai Sun, Hong Liu, and Rongrong Ji. Conditional diffusion models for cam- ouflaged and salient object detection.IEEE TPAMI, pages 2833–2848, 2025. 3, 6

work page 2025
[36]

Con- trastive grouping with transformer for referring image seg- mentation

Jiajin Tang, Ge Zheng, Cheng Shi, and Sibei Yang. Con- trastive grouping with transformer for referring image seg- mentation. InIEEE CVPR, pages 23570–23580, 2023. 6

work page 2023
[37]

Bi-directional object-context prioritization learning for saliency ranking

Xin Tian, Ke Xu, Xin Yang, Lin Du, Baocai Yin, and Ryn- son WH Lau. Bi-directional object-context prioritization learning for saliency ranking. InIEEE CVPR, pages 5882– 5891, 2022. 6

work page 2022
[38]

Yolov8 documentation, 2023

Ultralytics. Yolov8 documentation, 2023. 3, 4

work page 2023
[39]

Sclip: Rethinking self-attention for dense vision-language inference

Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Rethinking self-attention for dense vision-language inference. InECCV, pages 315–332, 2025. 8

work page 2025
[40]

Declip: Decoupled learning for open- vocabulary dense perception

Junjie Wang, Bin Chen, Yulin Li, Bin Kang, Yichi Chen, and Zhuotao Tian. Declip: Decoupled learning for open- vocabulary dense perception. InIEEE CVPR, pages 14824– 14834, 2025. 8

work page 2025
[41]

Learning to de- tect salient objects with image-level supervision

Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng, Dong Wang, Baocai Yin, and Xiang Ruan. Learning to de- tect salient objects with image-level supervision. InIEEE CVPR, pages 3796–3805, 2017. 6

work page 2017
[42]

Pixels, regions, and objects: Multiple enhance- ment for salient object detection

Yi Wang, Ruili Wang, Xin Fan, Tianzhu Wang, and Xi- angjian He. Pixels, regions, and objects: Multiple enhance- ment for salient object detection. InIEEE CVPR, pages 10031–10040, 2023. 2

work page 2023
[43]

Iterprime: Zero-shot referring image segmen- tation with iterative grad-cam refinement and primary word emphasis

Yuji Wang, Jingchen Ni, Yong Liu, Chun Yuan, and Yan- song Tang. Iterprime: Zero-shot referring image segmen- tation with iterative grad-cam refinement and primary word emphasis. InAAAI, pages 8159–8168, 2025. 6

work page 2025
[44]

Medsegdiff-v2: Diffusion-based medical image segmentation with transformer

Junde Wu, Wei Ji, Huazhu Fu, Min Xu, Yueming Jin, and Yanwu Xu. Medsegdiff-v2: Diffusion-based medical image segmentation with transformer. InAAAI, pages 6030–6038,

work page
[45]

Mobilesal: Extremely efficient rgb-d salient object detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 10261– 10269, 2022

Yu-Huan Wu, Yun Liu, and Jun Xu. Mobilesal: Extremely efficient rgb-d salient object detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 10261– 10269, 2022. 1

work page 2022
[46]

Edn: Salient object detection via extremely-downsampled network.IEEE Trans- actions on Image Processing, pages 3125–3136, 2022

Yu-Huan Wu, Yun Liu, and Le Zhang. Edn: Salient object detection via extremely-downsampled network.IEEE Trans- actions on Image Processing, pages 3125–3136, 2022. 2

work page 2022
[47]

Gsva: Generalized segmentation via multimodal large language models

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InIEEE CVPR, pages 3858–3869, 2024. 6

work page 2024
[48]

Bridging vision and language encoders: Parameter-efficient tuning for referring image seg- mentation

Zunnan Xu, Zhihong Chen, Yong Zhang, Yibing Song, Xi- ang Wan, and Guanbin Li. Bridging vision and language encoders: Parameter-efficient tuning for referring image seg- mentation. InIEEE CVPR, pages 17503–17512, 2023. 6

work page 2023
[49]

Generalized decoding for pixel, im- age and language

Zi-Yi Dou Xueyan Zou. Generalized decoding for pixel, im- age and language. pages 15116–15127. IEEE CVPR, 2023. 6

work page 2023
[50]

Hierarchical saliency detection

Qiong Yan, Li Xu, Jianping Shi, and Jiaya Jia. Hierarchical saliency detection. InIEEE CVPR, pages 1155–1162, 2013. 6

work page 2013
[51]

Remamber: Referring image segmentation with mamba twister.ECCV, pages 108–126,

Yuhuan Yang, Chaofan Ma, Jiangchao Yao, Zhun Zhong, Ya Zhang, and Yanfeng Wang. Remamber: Referring image segmentation with mamba twister.ECCV, pages 108–126,

work page
[52]

Evaluating salient object detection in natural images with multiple objects having multi-level saliency.IET Image Processing, 14(10):2249–2262, 2020

G ¨okhan Yildirim, Debashis Sen, Mohan Kankanhalli, and Sabine S ¨usstrunk. Evaluating salient object detection in natural images with multiple objects having multi-level saliency.IET Image Processing, 14(10):2249–2262, 2020. 7

work page 2020
[53]

Unified unsupervised salient object detection via knowledge transfer

Yao Yuan, Wutao Liu, Pan Gao, Qun Dai, and Jie Qin. Unified unsupervised salient object detection via knowledge transfer. InIJCAI, pages 1616–1624, 2024. 6

work page 2024
[54]

Adaptive se- lection based referring image segmentation

Pengfei Yue, Jianghang Lin, Shengchuan Zhang, Jie Hu, Yilin Lu, Hongwei Niu, Haixin Ding, Yan Zhang, GUAN- NAN JIANG, Liujuan Cao, and Rongrong Ji. Adaptive se- lection based referring image segmentation. InACM MM, pages 1101–1110, 2024. 6

work page 2024
[55]

Position fusing and refining for clear salient object detection.IEEE TNNLS, pages 4019–4028, 2025

Xing Zhao, Haoran Liang, and Ronghua Liang. Position fusing and refining for clear salient object detection.IEEE TNNLS, pages 4019–4028, 2025. 6

work page 2025
[56]

Texture-guided saliency distilling for unsuper- vised salient object detection

Huajun Zhou, Bo Qiao, Lingxiao Yang, Jianhuang Lai, and Xiaohua Xie. Texture-guided saliency distilling for unsuper- vised salient object detection. InIEEE CVPR, pages 7257– 7267, 2023. 2

work page 2023
[57]

Priornet: Two deep prior cues for salient object detection.IEEE TMM, pages 5523–5535, 2024

Ge Zhu, Jinbao Li, and Yahong Guo. Priornet: Two deep prior cues for salient object detection.IEEE TMM, pages 5523–5535, 2024. 6 10

work page 2024
[58]

Mlst-former: Multi-level spatial-temporal trans- former for group activity recognition.IEEE TCSVT, pages 3383–3397, 2023

Xiaolin Zhu, Yan Zhou, Dongli Wang, Wanli Ouyang, and Rui Su. Mlst-former: Multi-level spatial-temporal trans- former for group activity recognition.IEEE TCSVT, pages 3383–3397, 2023. 1 11

work page 2023