BED-SAM2: Boundary-Enhanced-Depth SAM2 via Monocular Geometric Priors

Chandra Kambhamettu; Colin Kelly; Dara McNally; Kyle O'Donnell; Tyler Rust

arxiv: 2605.24893 · v1 · pith:MXTTCHSGnew · submitted 2026-05-24 · 💻 cs.CV

BED-SAM2: Boundary-Enhanced-Depth SAM2 via Monocular Geometric Priors

Tyler Rust , Dara McNally , Kyle O'Donnell , Colin Kelly , Chandra Kambhamettu This is my paper

Pith reviewed 2026-06-30 12:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords SAM2monocular depthboundary enhancementcamouflaged object detectionsalient object detectiongeometric priorsHiera encoder

0 comments

The pith

BED-SAM2 modifies the SAM2 Hiera encoder to encode monocular depth from RGB images for sharper object boundaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BED-SAM2 as a direct extension of the SAM2 vision foundation model. It alters the Hiera encoder to accept and process monocular depth maps extracted from standard RGB inputs. These depth signals supply geometric information intended to improve boundary precision during segmentation. The resulting model reaches competitive results on salient object detection and camouflaged object detection benchmarks after only five training epochs. Readers would care because the change is presented as lightweight yet effective for tasks where objects blend with backgrounds.

Core claim

BED-SAM2 modifies the SAM2 Hiera encoder architecture so that it directly encodes monocular depth information obtained from RGB images. The added depth channel supplies geometric cues that support more accurate delineation of object boundaries and extraction of camouflaged shapes. This yields competitive state-of-the-art performance on multiple salient and camouflaged object detection tasks while requiring as few as five training epochs.

What carries the argument

The modified SAM2 Hiera encoder that ingests monocular depth maps alongside RGB to inject geometric priors for boundary refinement.

If this is right

Object boundary accuracy improves in both salient and camouflaged detection settings.
Performance reaches competitive levels on standard benchmarks after minimal fine-tuning.
The same encoder change applies across multiple related detection tasks without task-specific redesign.
Geometric cues from depth reduce reliance on appearance alone for shape extraction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same depth-injection pattern could be tested on other segmentation foundation models to check transferability.
If monocular depth proves consistently helpful, training pipelines for boundary-sensitive tasks might incorporate depth estimation as a standard preprocessing step.
Medical or aerial imagery domains where boundaries are critical could serve as natural next testbeds for the approach.

Load-bearing premise

Monocular depth estimates derived from RGB images supply reliable geometric cues that improve boundary detection without introducing errors or needing large architectural revisions.

What would settle it

A controlled ablation that removes the depth-encoding branch and shows equivalent or higher accuracy on the same detection benchmarks would falsify the claim that the depth channel is the source of the reported gains.

Figures

Figures reproduced from arXiv: 2605.24893 by Chandra Kambhamettu, Colin Kelly, Dara McNally, Kyle O'Donnell, Tyler Rust.

**Figure 1.** Figure 1: Cumulative structure map from monocular depth. Sobel filters are applied independently to the RGB, raw depth, inverse depth, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the proposed BED-SAM2 architecture, adapted from SAM2-UNet [ [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Building upon the SAM2 vision foundation model for downstream segmentation, this study introduces Boundary Enhanced Depth (BED)-SAM2. The SAM2 Hiera encoder architecture is modified to directly encode monocular depth information from RGB images, thereby providing geometric cues that enhance object boundary delineation and facilitate the extraction of camouflaged object shapes. BED-SAM2 demonstrates competitive state-of-the-art performance across multiple salient and camouflaged object detection tasks with as few as five training epochs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BED-SAM2 adds monocular depth to SAM2's encoder for boundary help on camouflaged objects, but the depth comes from the same RGB so error injection looks like the main risk.

read the letter

The paper's move is to modify SAM2's Hiera encoder so it directly takes monocular depth estimates alongside the RGB input. The goal is to supply geometric cues that sharpen object boundaries, especially in camouflaged scenes. They report competitive results on salient and camouflaged object detection after only five training epochs.

The efficiency part is concrete and worth noting. If the numbers hold with standard metrics and reasonable baselines, that low-epoch performance could be useful for people fine-tuning foundation models.

The modification itself stays close to the original architecture, which avoids big changes. Credit for keeping the approach simple and for targeting a real pain point in segmentation.

The stress-test concern lands. Depth is estimated from the identical RGB image, and camouflaged or low-texture areas are exactly where monocular estimators produce the most errors. Nothing in the abstract shows a mechanism to isolate or correct those errors before they reach the encoder. Without ablations that isolate the depth contribution and demonstrate it adds signal rather than noise, the performance claim is hard to trust.

The full paper would need to show the depth estimator used, how the encoding is implemented, and controlled experiments that rule out the depth simply being redundant or harmful. If those sections are thin, the central assumption does not hold up.

This is for readers already working with SAM2 or similar segmentation models who want a quick tweak for boundary cases. It is not broad enough or novel enough on its own to change the field.

I would send it for peer review if the experiments include proper ablations and multiple datasets, because the claim is testable. Otherwise it is a minor variant that does not need referee time.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes BED-SAM2, a modification of the SAM2 vision foundation model in which the Hiera encoder is altered to directly encode monocular depth maps estimated from the input RGB images. This is intended to supply geometric priors that improve object boundary delineation, with particular emphasis on camouflaged object detection. The paper reports that the resulting model achieves competitive state-of-the-art performance on multiple salient and camouflaged object detection benchmarks after only five training epochs.

Significance. If the performance claims are substantiated by rigorous experiments, the work would demonstrate a lightweight way to inject monocular geometric information into a large vision foundation model without major architectural overhaul, potentially benefiting downstream segmentation tasks that rely on boundary accuracy.

major comments (1)

[Method description] Method description (no section number supplied in available text): the central assumption that directly encoding monocular depth supplies reliable boundary-enhancing cues is not accompanied by any analysis of depth-estimation error rates or their propagation through the Hiera encoder. In camouflaged or low-texture regions—precisely the regimes highlighted in the abstract—monocular depth estimators are known to produce large errors; without explicit mitigation or ablation showing that these errors do not degrade the encoder features, the reported gains could be illusory.

minor comments (1)

[Abstract] The abstract asserts 'competitive state-of-the-art performance' and 'as few as five training epochs' but supplies no quantitative metrics, baselines, datasets, or statistical significance tests, making the claim impossible to evaluate from the provided text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the method description. We respond to the major comment below.

read point-by-point responses

Referee: the central assumption that directly encoding monocular depth supplies reliable boundary-enhancing cues is not accompanied by any analysis of depth-estimation error rates or their propagation through the Hiera encoder. In camouflaged or low-texture regions—precisely the regimes highlighted in the abstract—monocular depth estimators are known to produce large errors; without explicit mitigation or ablation showing that these errors do not degrade the encoder features, the reported gains could be illusory.

Authors: We agree that the manuscript lacks an explicit analysis of depth-estimation error rates and their propagation through the Hiera encoder. In the revised manuscript we will add a dedicated subsection that (i) reports standard depth error metrics (AbsRel, RMSE) of the monocular estimator on the camouflaged-object benchmarks, (ii) presents an ablation that injects controlled noise into the depth maps at levels matching observed error statistics, and (iii) measures the resulting change in boundary F-measure and mIoU. This will directly test whether the reported gains remain robust under realistic depth inaccuracies in low-texture regions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture change with no derivation chain

full rationale

The paper describes a direct architectural modification to the SAM2 Hiera encoder to accept monocular depth maps alongside RGB input. No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided text. Performance claims rest on reported training results across datasets rather than any closed mathematical reduction to inputs. The monocular-depth integration is an explicit design choice, not a derived quantity that equals its own construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not provide sufficient technical details to identify any free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5610 in / 1252 out tokens · 38441 ms · 2026-06-30T12:00:04.283358+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Sam2-adapter: Evaluating & adapting seg- ment anything 2 in downstream tasks: Camouflage, shadow, medical image segmentation, and more.arXiv preprint arXiv:2408.04579, 2024

Tianrun Chen, Ankang Lu, Lanyun Zhu, Chaotao Ding, Chu- nan Yu, Deyi Ji, Zejian Li, Lingyun Sun, Papa Mao, and Ying Zang. Sam2-adapter: Evaluating & adapting seg- ment anything 2 in downstream tasks: Camouflage, shadow, medical image segmentation, and more.arXiv preprint arXiv:2408.04579, 2024. 2, 5

work page arXiv 2024
[2]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255, 2009. 1

2009
[3]

Iako- vidis

George Dimas, Panagiota Gatoula, and Dimitris K. Iako- vidis. MonoSOD: Monocular salient object detection based on predicted depth. InIEEE International Conference on Robotics and Automation (ICRA), pages 4377–4383, 2021. 6

2021
[4]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 1

2021
[5]

Structure-measure: A new way to evaluate foreground maps

Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and Ali Borji. Structure-measure: A new way to evaluate foreground maps. InICCV, pages 4548–4557, 2017. 4

2017
[6]

Enhanced-alignment measure for binary foreground map evaluation

Deng-Ping Fan, Cheng Gong, Yang Cao, Bo Ren, Ming- Ming Cheng, and Ali Borji. Enhanced-alignment measure for binary foreground map evaluation. InIJCAI, pages 698– 704, 2018. 4

2018
[7]

Camouflaged object detec- tion

Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng, Jianbing Shen, and Ling Shao. Camouflaged object detec- tion. InCVPR, pages 2774–2784, 2020. 4, 5, 7

2020
[8]

Pranet: Parallel reverse attention network for polyp segmentation

Deng-Ping Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, Huazhu Fu, Jianbing Shen, and Ling Shao. Pranet: Parallel reverse attention network for polyp segmentation. InMedical Image Computing and Computer Assisted Intervention (MICCAI), pages 263–273, 2020. 3, 4

2020
[9]

Rethinking rgb-d salient object detec- tion: Models, data sets, and large-scale benchmarks.IEEE Trans

Deng-Ping Fan, Zheng Lin, Zhao Zhang, Menglong Zhu, and Ming-Ming Cheng. Rethinking rgb-d salient object detec- tion: Models, data sets, and large-scale benchmarks.IEEE Trans. Neural Netw. Learn. Syst., 32(5):2044–2059, 2021. 4, 5, 7

2044
[10]

Concealed object detection.IEEE TPAMI, 44(10): 6024–6042, 2022

Deng-Ping Fan, Ge-Peng Ji, Ming-Ming Cheng, and Ling Shao. Concealed object detection.IEEE TPAMI, 44(10): 6024–6042, 2022. 5

2022
[11]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016. 1, 2

2016
[12]

Distill any depth: Distillation cre- ates a stronger monocular depth estimator.arXiv preprint arXiv:2502.19204, 2025

Xiankang He, Dongyan Guo, Hongji Li, Ruibo Li, Ying Cui, and Chi Zhang. Distill any depth: Distillation cre- ates a stronger monocular depth estimator.arXiv preprint arXiv:2502.19204, 2025. 4

work page arXiv 2025
[13]

Stereo processing by semiglobal match- ing and mutual information.IEEE TPAMI, 30(2):328–341,

Heiko Hirschmuller. Stereo processing by semiglobal match- ing and mutual information.IEEE TPAMI, 30(2):328–341,
[14]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR, 2022. 3, 5

2022
[15]

Depth saliency based on anisotropic center- surround difference

Ran Ju, Ling Ge, Wenjing Geng, Tongwei Ren, and Gang- shan Wu. Depth saliency based on anisotropic center- surround difference. InIEEE Int. Conf. Image Process. (ICIP), pages 1115–1119, 2014. 4, 5

2014
[16]

Nick Kanopoulos, Nagesh Vasanthavada, and Robert L. Baker. Design of an image edge detection filter using the sobel operator.IEEE J. Solid-State Circuits, 23(2):358–367,
[17]

Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything. InICCV, pages 3992– 4003, 2023. 1, 2

2023
[18]

Nguyen, Zhongliang Nie, Minh- Triet Tran, and Akihiro Sugimoto

Trung-Nghia Le, Tam V . Nguyen, Zhongliang Nie, Minh- Triet Tran, and Akihiro Sugimoto. Anabranch network for camouflaged object segmentation.Comput. Vis. Image Un- derst., 184:45–56, 2019. 4, 5, 7

2019
[19]

Visual saliency based on multi- scale deep features

Guanbin Li and Yizhou Yu. Visual saliency based on multi- scale deep features. InCVPR, pages 5455–5463, 2015. 4

2015
[20]

Rehg, and Alan L

Yin Li, Xiaodi Hou, Christof Koch, James M. Rehg, and Alan L. Yuille. The secrets of salient object segmentation. InCVPR, pages 280–287, 2014. 4

2014
[21]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014. 1

2014
[22]

Receptive field block net for accurate and fast object detection

Songtao Liu, Di Huang, and Yunhong Wang. Receptive field block net for accurate and fast object detection. InECCV, pages 404–419, 2018. 3

2018
[23]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations (ICLR), 2019. 5

2019
[24]

VSCode: General visual salient and camouflaged object de- tection with 2D prompt learning

Ziyang Luo, Nian Liu, Wangbo Zhao, Xuguang Yang, Ding- wen Zhang, Deng-Ping Fan, Fahad Khan, and Junwei Han. VSCode: General visual salient and camouflaged object de- tection with 2D prompt learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17169–17180, 2024. 6, 7

2024
[25]

Simultaneously lo- calize, segment and rank the camouflaged objects

Yunqiu Lyu, Jing Zhang, Yuchao Dai, Aixuan Li, Bowen Liu, Nick Barnes, and Deng-Ping Fan. Simultaneously lo- calize, segment and rank the camouflaged objects. InCVPR, pages 11591–11601, 2021. 4, 5, 7

2021
[26]

How to evaluate foreground maps

Ran Margolin, Lihi Zelnik-Manor, and Ayellet Tal. How to evaluate foreground maps. InIEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 248– 255, 2014. 4

2014
[27]

Leveraging stereopsis for saliency analysis

Yuzhen Niu, Yucheng Geng, Xueqing Li, and Feng Liu. Leveraging stereopsis for saliency analysis. InCVPR, pages 454–461, 2012. 4, 5, 7

2012
[28]

ZoomNeXt: A unified collaborative pyramid network for camouflaged object detection.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 46(12):9205–9220, 2024

Youwei Pang, Xiaoqi Zhao, Tian-Zhu Xiang, Lihe Zhang, and Huchuan Lu. ZoomNeXt: A unified collaborative pyramid network for camouflaged object detection.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 46(12):9205–9220, 2024. 7

2024
[29]

Rgbd salient object detection: A benchmark and algorithms

Houwen Peng, Bing Li, Weihua Xiong, Weiming Hu, and Rongrong Ji. Rgbd salient object detection: A benchmark and algorithms. InECCV, pages 92–109, 2014. 4, 5, 7

2014
[30]

Zaiane, and Martin Jagersand

Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood De- hghan, Osmar R. Zaiane, and Martin Jagersand. U2-Net: Going deeper with nested U-Structure for salient object de- tection.Pattern Recognition, 106:107404, 2020. 6

2020
[31]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInt. Conf. Mach. Learn. (ICML), pages 8748–8763, 2021. 1

2021
[32]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer Assisted Inter- vention (MICCAI), pages 234–241, 2015. 2, 3

2015
[34]

Hiera: A hier- archical vision transformer without the bells-and-whistles

Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Ma- lik, Yanghao Li, and Christoph Feichtenhofer. Hiera: A hier- archical vision transformer without the bells-and-whistles. In Int. Conf. Mach. Learn. (ICML), pages 29441–29454, 2023. 2

2023
[35]

A taxonomy and evaluation of dense two-frame stereo correspondence algo- rithms.IJCV, 47:7–42, 2002

Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algo- rithms.IJCV, 47:7–42, 2002. 2

2002
[36]

Very deep convo- lutional networks for large-scale image recognition

Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition. InICLR,
[37]

Błaszczyk, Tomasz Depta, Adam Kornacki, and Paweł Kozieł

Przemysław Skurowski, Hassan Abdulameer, J. Błaszczyk, Tomasz Depta, Adam Kornacki, and Paweł Kozieł. Ani- mal camouflage analysis: Chameleon database.Unpublished manuscript, 2018. 4, 5, 7

2018
[38]

Learning to de- tect salient objects with image-level supervision

Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng, Dong Wang, Baocai Yin, and Xiang Ruan. Learning to de- tect salient objects with image-level supervision. InCVPR, pages 136–145, 2017. 4, 6

2017
[39]

Depth-aided camouflaged object detection

Qingwei Wang, Jinmiao Zheng, Guangyu Qian, Jinghui Dong, Ling Shao, and Ge-Peng Ji. Depth-aided camouflaged object detection. InACM MM, pages 8298–8307, 2023. 2

2023
[40]

Pixels, regions, and objects: Multiple enhancement for salient object detection

Yi Wang, Ruili Deng, Qiong Pan, Mingchen Zhuge, Ge-Peng Ji, and Deng-Ping Fan. Pixels, regions, and objects: Multiple enhancement for salient object detection. InCVPR, pages 10031–10040, 2023. 6

2023
[41]

F 3Net: Fu- sion, feedback and focus for salient object detection

Jun Wei, Shuhui Wang, and Qingming Huang. F 3Net: Fu- sion, feedback and focus for salient object detection. InAAAI Conference on Artificial Intelligence (AAAI), pages 12321– 12328, 2020. 4

2020
[42]

Edn: Salient object detection via extremely- downsampled network.IEEE TIP, 31:3542–3555, 2022

Yu-Huan Wu, Yun Liu, Le Zhang, Ming-Ming Cheng, and Bo Hu. Edn: Salient object detection via extremely- downsampled network.IEEE TIP, 31:3542–3555, 2022. 6

2022
[43]

HiDAnet: RGB-D salient ob- ject detection via hierarchical depth awareness.IEEE Trans- actions on Image Processing (TIP), 32:2160–2173, 2023

Zongwei Wu, Guillaume Allibert, Fabrice Meriaudeau, Chao Ma, and C´edric Demonceaux. HiDAnet: RGB-D salient ob- ject detection via hierarchical depth awareness.IEEE Trans- actions on Image Processing (TIP), 32:2160–2173, 2023. 6

2023
[44]

Exploring depth contri- bution for camouflaged object detection.arXiv preprint arXiv:2106.13217, 2021

Mochu Xiang, Jing Zhang, Yunqiu Lv, Aixuan Li, Yi- ran Zhong, and Yuchao Dai. Exploring depth contri- bution for camouflaged object detection.arXiv preprint arXiv:2106.13217, 2021. 2

work page arXiv 2021
[45]

Pyramid grafting network for one- stage high resolution saliency detection

Chenxi Xie, Changqun Xia, Mingcan Ma, Zhirui Zhao, Xi- aowu Chen, and Jia Li. Pyramid grafting network for one- stage high resolution saliency detection. InCVPR, pages 11717–11726, 2022. 4, 5, 6

2022
[46]

Sam2-unet: Segment anything 2 makes strong encoder for natural and medical image segmentation.Visual Intelligence, 4(1):2, 2026

Xinyu Xiong, Zihuang Wu, Shuangyi Tan, Wenxue Li, Fei- long Tang, Ying Chen, Siying Li, Jie Ma, and Guanbin Li. Sam2-unet: Segment anything 2 makes strong encoder for natural and medical image segmentation.Visual Intelligence, 4(1):2, 2026. 1, 3, 4, 5, 6, 7

2026
[47]

Hierarchical saliency detection

Qiong Yan, Li Xu, Jianping Shi, and Jiaya Jia. Hierarchical saliency detection. InCVPR, pages 1155–1162, 2013. 4

2013
[48]

Saliency detection via graph-based man- ifold ranking

Chuan Yang, Lihe Zhang, Huchuan Lu, Xiang Ruan, and Ming-Hsuan Yang. Saliency detection via graph-based man- ifold ranking. InCVPR, pages 3166–3173, 2013. 4

2013
[49]

Dformerv2: Geometry self- attention for rgbd semantic segmentation

Bo-Wen Yin, Yan-Jie Zhang, Pengyu Zhou, Jifeng Zhao, Luc Van Gool, and Qibin Zhang. Dformerv2: Geometry self- attention for rgbd semantic segmentation. InCVPR, 2025. 6

2025
[50]

Towards high-resolution salient object detec- tion

Yi Zeng, Pingping Zhang, Jianming Zhang, Zhe Lin, and Huchuan Lu. Towards high-resolution salient object detec- tion. InICCV, pages 7234–7243, 2019. 4, 5, 6

2019
[51]

Rgb-d saliency de- tection via cascaded mutual information minimization

Jing Zhang, Deng-Ping Fan, Yuchao Dai, Xin Yu, Yiran Zhong, Nick Barnes, and Ling Shao. Rgb-d saliency de- tection via cascaded mutual information minimization. In ICCV, pages 4338–4347, 2021. 6

2021
[52]

Fastersal: Robust and real-time single-stream ar- chitecture for rgb-d salient object detection.IEEE TMM, 27: 1507–1519, 2025

Jin Zhang, Zhao Liu, Yanliang Ye, Huibing Bi, and Deng- Ping Fan. Fastersal: Robust and real-time single-stream ar- chitecture for rgb-d salient object detection.IEEE TMM, 27: 1507–1519, 2025. 6

2025
[53]

Bilateral refer- ence for high-resolution dichotomous image segmentation

Peng Zheng, Dehong Gao, Deng-Ping Fan, Li Liu, Jorma Laaksonen, Wanli Ouyang, and Nicu Sebe. Bilateral refer- ence for high-resolution dichotomous image segmentation. CAAI Artif. Intell. Res., 3:9150038, 2024. 2, 5, 6, 7

2024
[54]

Salient object detection via integrity learning.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (TPAMI), 45(3):3738–3752,

Mingchen Zhuge, Deng-Ping Fan, Nian Liu, Dingwen Zhang, Dong Xu, and Ling Shao. Salient object detection via integrity learning.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (TPAMI), 45(3):3738–3752,

[1] [1]

Sam2-adapter: Evaluating & adapting seg- ment anything 2 in downstream tasks: Camouflage, shadow, medical image segmentation, and more.arXiv preprint arXiv:2408.04579, 2024

Tianrun Chen, Ankang Lu, Lanyun Zhu, Chaotao Ding, Chu- nan Yu, Deyi Ji, Zejian Li, Lingyun Sun, Papa Mao, and Ying Zang. Sam2-adapter: Evaluating & adapting seg- ment anything 2 in downstream tasks: Camouflage, shadow, medical image segmentation, and more.arXiv preprint arXiv:2408.04579, 2024. 2, 5

work page arXiv 2024

[2] [2]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255, 2009. 1

2009

[3] [3]

Iako- vidis

George Dimas, Panagiota Gatoula, and Dimitris K. Iako- vidis. MonoSOD: Monocular salient object detection based on predicted depth. InIEEE International Conference on Robotics and Automation (ICRA), pages 4377–4383, 2021. 6

2021

[4] [4]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 1

2021

[5] [5]

Structure-measure: A new way to evaluate foreground maps

Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and Ali Borji. Structure-measure: A new way to evaluate foreground maps. InICCV, pages 4548–4557, 2017. 4

2017

[6] [6]

Enhanced-alignment measure for binary foreground map evaluation

Deng-Ping Fan, Cheng Gong, Yang Cao, Bo Ren, Ming- Ming Cheng, and Ali Borji. Enhanced-alignment measure for binary foreground map evaluation. InIJCAI, pages 698– 704, 2018. 4

2018

[7] [7]

Camouflaged object detec- tion

Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng, Jianbing Shen, and Ling Shao. Camouflaged object detec- tion. InCVPR, pages 2774–2784, 2020. 4, 5, 7

2020

[8] [8]

Pranet: Parallel reverse attention network for polyp segmentation

Deng-Ping Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, Huazhu Fu, Jianbing Shen, and Ling Shao. Pranet: Parallel reverse attention network for polyp segmentation. InMedical Image Computing and Computer Assisted Intervention (MICCAI), pages 263–273, 2020. 3, 4

2020

[9] [9]

Rethinking rgb-d salient object detec- tion: Models, data sets, and large-scale benchmarks.IEEE Trans

Deng-Ping Fan, Zheng Lin, Zhao Zhang, Menglong Zhu, and Ming-Ming Cheng. Rethinking rgb-d salient object detec- tion: Models, data sets, and large-scale benchmarks.IEEE Trans. Neural Netw. Learn. Syst., 32(5):2044–2059, 2021. 4, 5, 7

2044

[10] [10]

Concealed object detection.IEEE TPAMI, 44(10): 6024–6042, 2022

Deng-Ping Fan, Ge-Peng Ji, Ming-Ming Cheng, and Ling Shao. Concealed object detection.IEEE TPAMI, 44(10): 6024–6042, 2022. 5

2022

[11] [11]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016. 1, 2

2016

[12] [12]

Distill any depth: Distillation cre- ates a stronger monocular depth estimator.arXiv preprint arXiv:2502.19204, 2025

Xiankang He, Dongyan Guo, Hongji Li, Ruibo Li, Ying Cui, and Chi Zhang. Distill any depth: Distillation cre- ates a stronger monocular depth estimator.arXiv preprint arXiv:2502.19204, 2025. 4

work page arXiv 2025

[13] [13]

Stereo processing by semiglobal match- ing and mutual information.IEEE TPAMI, 30(2):328–341,

Heiko Hirschmuller. Stereo processing by semiglobal match- ing and mutual information.IEEE TPAMI, 30(2):328–341,

[14] [14]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR, 2022. 3, 5

2022

[15] [15]

Depth saliency based on anisotropic center- surround difference

Ran Ju, Ling Ge, Wenjing Geng, Tongwei Ren, and Gang- shan Wu. Depth saliency based on anisotropic center- surround difference. InIEEE Int. Conf. Image Process. (ICIP), pages 1115–1119, 2014. 4, 5

2014

[16] [16]

Nick Kanopoulos, Nagesh Vasanthavada, and Robert L. Baker. Design of an image edge detection filter using the sobel operator.IEEE J. Solid-State Circuits, 23(2):358–367,

[17] [17]

Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything. InICCV, pages 3992– 4003, 2023. 1, 2

2023

[18] [18]

Nguyen, Zhongliang Nie, Minh- Triet Tran, and Akihiro Sugimoto

Trung-Nghia Le, Tam V . Nguyen, Zhongliang Nie, Minh- Triet Tran, and Akihiro Sugimoto. Anabranch network for camouflaged object segmentation.Comput. Vis. Image Un- derst., 184:45–56, 2019. 4, 5, 7

2019

[19] [19]

Visual saliency based on multi- scale deep features

Guanbin Li and Yizhou Yu. Visual saliency based on multi- scale deep features. InCVPR, pages 5455–5463, 2015. 4

2015

[20] [20]

Rehg, and Alan L

Yin Li, Xiaodi Hou, Christof Koch, James M. Rehg, and Alan L. Yuille. The secrets of salient object segmentation. InCVPR, pages 280–287, 2014. 4

2014

[21] [21]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014. 1

2014

[22] [22]

Receptive field block net for accurate and fast object detection

Songtao Liu, Di Huang, and Yunhong Wang. Receptive field block net for accurate and fast object detection. InECCV, pages 404–419, 2018. 3

2018

[23] [23]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations (ICLR), 2019. 5

2019

[24] [24]

VSCode: General visual salient and camouflaged object de- tection with 2D prompt learning

Ziyang Luo, Nian Liu, Wangbo Zhao, Xuguang Yang, Ding- wen Zhang, Deng-Ping Fan, Fahad Khan, and Junwei Han. VSCode: General visual salient and camouflaged object de- tection with 2D prompt learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17169–17180, 2024. 6, 7

2024

[25] [25]

Simultaneously lo- calize, segment and rank the camouflaged objects

Yunqiu Lyu, Jing Zhang, Yuchao Dai, Aixuan Li, Bowen Liu, Nick Barnes, and Deng-Ping Fan. Simultaneously lo- calize, segment and rank the camouflaged objects. InCVPR, pages 11591–11601, 2021. 4, 5, 7

2021

[26] [26]

How to evaluate foreground maps

Ran Margolin, Lihi Zelnik-Manor, and Ayellet Tal. How to evaluate foreground maps. InIEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 248– 255, 2014. 4

2014

[27] [27]

Leveraging stereopsis for saliency analysis

Yuzhen Niu, Yucheng Geng, Xueqing Li, and Feng Liu. Leveraging stereopsis for saliency analysis. InCVPR, pages 454–461, 2012. 4, 5, 7

2012

[28] [28]

ZoomNeXt: A unified collaborative pyramid network for camouflaged object detection.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 46(12):9205–9220, 2024

Youwei Pang, Xiaoqi Zhao, Tian-Zhu Xiang, Lihe Zhang, and Huchuan Lu. ZoomNeXt: A unified collaborative pyramid network for camouflaged object detection.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 46(12):9205–9220, 2024. 7

2024

[29] [29]

Rgbd salient object detection: A benchmark and algorithms

Houwen Peng, Bing Li, Weihua Xiong, Weiming Hu, and Rongrong Ji. Rgbd salient object detection: A benchmark and algorithms. InECCV, pages 92–109, 2014. 4, 5, 7

2014

[30] [30]

Zaiane, and Martin Jagersand

Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood De- hghan, Osmar R. Zaiane, and Martin Jagersand. U2-Net: Going deeper with nested U-Structure for salient object de- tection.Pattern Recognition, 106:107404, 2020. 6

2020

[31] [31]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInt. Conf. Mach. Learn. (ICML), pages 8748–8763, 2021. 1

2021

[32] [32]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer Assisted Inter- vention (MICCAI), pages 234–241, 2015. 2, 3

2015

[34] [34]

Hiera: A hier- archical vision transformer without the bells-and-whistles

Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Ma- lik, Yanghao Li, and Christoph Feichtenhofer. Hiera: A hier- archical vision transformer without the bells-and-whistles. In Int. Conf. Mach. Learn. (ICML), pages 29441–29454, 2023. 2

2023

[35] [35]

A taxonomy and evaluation of dense two-frame stereo correspondence algo- rithms.IJCV, 47:7–42, 2002

Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algo- rithms.IJCV, 47:7–42, 2002. 2

2002

[36] [36]

Very deep convo- lutional networks for large-scale image recognition

Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition. InICLR,

[37] [37]

Błaszczyk, Tomasz Depta, Adam Kornacki, and Paweł Kozieł

Przemysław Skurowski, Hassan Abdulameer, J. Błaszczyk, Tomasz Depta, Adam Kornacki, and Paweł Kozieł. Ani- mal camouflage analysis: Chameleon database.Unpublished manuscript, 2018. 4, 5, 7

2018

[38] [38]

Learning to de- tect salient objects with image-level supervision

Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng, Dong Wang, Baocai Yin, and Xiang Ruan. Learning to de- tect salient objects with image-level supervision. InCVPR, pages 136–145, 2017. 4, 6

2017

[39] [39]

Depth-aided camouflaged object detection

Qingwei Wang, Jinmiao Zheng, Guangyu Qian, Jinghui Dong, Ling Shao, and Ge-Peng Ji. Depth-aided camouflaged object detection. InACM MM, pages 8298–8307, 2023. 2

2023

[40] [40]

Pixels, regions, and objects: Multiple enhancement for salient object detection

Yi Wang, Ruili Deng, Qiong Pan, Mingchen Zhuge, Ge-Peng Ji, and Deng-Ping Fan. Pixels, regions, and objects: Multiple enhancement for salient object detection. InCVPR, pages 10031–10040, 2023. 6

2023

[41] [41]

F 3Net: Fu- sion, feedback and focus for salient object detection

Jun Wei, Shuhui Wang, and Qingming Huang. F 3Net: Fu- sion, feedback and focus for salient object detection. InAAAI Conference on Artificial Intelligence (AAAI), pages 12321– 12328, 2020. 4

2020

[42] [42]

Edn: Salient object detection via extremely- downsampled network.IEEE TIP, 31:3542–3555, 2022

Yu-Huan Wu, Yun Liu, Le Zhang, Ming-Ming Cheng, and Bo Hu. Edn: Salient object detection via extremely- downsampled network.IEEE TIP, 31:3542–3555, 2022. 6

2022

[43] [43]

HiDAnet: RGB-D salient ob- ject detection via hierarchical depth awareness.IEEE Trans- actions on Image Processing (TIP), 32:2160–2173, 2023

Zongwei Wu, Guillaume Allibert, Fabrice Meriaudeau, Chao Ma, and C´edric Demonceaux. HiDAnet: RGB-D salient ob- ject detection via hierarchical depth awareness.IEEE Trans- actions on Image Processing (TIP), 32:2160–2173, 2023. 6

2023

[44] [44]

Exploring depth contri- bution for camouflaged object detection.arXiv preprint arXiv:2106.13217, 2021

Mochu Xiang, Jing Zhang, Yunqiu Lv, Aixuan Li, Yi- ran Zhong, and Yuchao Dai. Exploring depth contri- bution for camouflaged object detection.arXiv preprint arXiv:2106.13217, 2021. 2

work page arXiv 2021

[45] [45]

Pyramid grafting network for one- stage high resolution saliency detection

Chenxi Xie, Changqun Xia, Mingcan Ma, Zhirui Zhao, Xi- aowu Chen, and Jia Li. Pyramid grafting network for one- stage high resolution saliency detection. InCVPR, pages 11717–11726, 2022. 4, 5, 6

2022

[46] [46]

Sam2-unet: Segment anything 2 makes strong encoder for natural and medical image segmentation.Visual Intelligence, 4(1):2, 2026

Xinyu Xiong, Zihuang Wu, Shuangyi Tan, Wenxue Li, Fei- long Tang, Ying Chen, Siying Li, Jie Ma, and Guanbin Li. Sam2-unet: Segment anything 2 makes strong encoder for natural and medical image segmentation.Visual Intelligence, 4(1):2, 2026. 1, 3, 4, 5, 6, 7

2026

[47] [47]

Hierarchical saliency detection

Qiong Yan, Li Xu, Jianping Shi, and Jiaya Jia. Hierarchical saliency detection. InCVPR, pages 1155–1162, 2013. 4

2013

[48] [48]

Saliency detection via graph-based man- ifold ranking

Chuan Yang, Lihe Zhang, Huchuan Lu, Xiang Ruan, and Ming-Hsuan Yang. Saliency detection via graph-based man- ifold ranking. InCVPR, pages 3166–3173, 2013. 4

2013

[49] [49]

Dformerv2: Geometry self- attention for rgbd semantic segmentation

Bo-Wen Yin, Yan-Jie Zhang, Pengyu Zhou, Jifeng Zhao, Luc Van Gool, and Qibin Zhang. Dformerv2: Geometry self- attention for rgbd semantic segmentation. InCVPR, 2025. 6

2025

[50] [50]

Towards high-resolution salient object detec- tion

Yi Zeng, Pingping Zhang, Jianming Zhang, Zhe Lin, and Huchuan Lu. Towards high-resolution salient object detec- tion. InICCV, pages 7234–7243, 2019. 4, 5, 6

2019

[51] [51]

Rgb-d saliency de- tection via cascaded mutual information minimization

Jing Zhang, Deng-Ping Fan, Yuchao Dai, Xin Yu, Yiran Zhong, Nick Barnes, and Ling Shao. Rgb-d saliency de- tection via cascaded mutual information minimization. In ICCV, pages 4338–4347, 2021. 6

2021

[52] [52]

Fastersal: Robust and real-time single-stream ar- chitecture for rgb-d salient object detection.IEEE TMM, 27: 1507–1519, 2025

Jin Zhang, Zhao Liu, Yanliang Ye, Huibing Bi, and Deng- Ping Fan. Fastersal: Robust and real-time single-stream ar- chitecture for rgb-d salient object detection.IEEE TMM, 27: 1507–1519, 2025. 6

2025

[53] [53]

Bilateral refer- ence for high-resolution dichotomous image segmentation

Peng Zheng, Dehong Gao, Deng-Ping Fan, Li Liu, Jorma Laaksonen, Wanli Ouyang, and Nicu Sebe. Bilateral refer- ence for high-resolution dichotomous image segmentation. CAAI Artif. Intell. Res., 3:9150038, 2024. 2, 5, 6, 7

2024

[54] [54]

Salient object detection via integrity learning.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (TPAMI), 45(3):3738–3752,

Mingchen Zhuge, Deng-Ping Fan, Nian Liu, Dingwen Zhang, Dong Xu, and Ling Shao. Salient object detection via integrity learning.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (TPAMI), 45(3):3738–3752,