Global-Local Feature Decoding with Adapter-Guided SAMv2 for Salient Object Detection
Pith reviewed 2026-05-08 18:12 UTC · model grok-4.3
The pith
GLASSNet freezes SAMv2, adds a lightweight adapter cutting parameters by over 97 percent, and fuses global and local decoders to produce more accurate saliency maps than prior methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GLASSNet is a Global-Local feature decoding framework that uses SAMv2 as a frozen encoder paired with a lightweight, spatially aware convolutional adapter-reducing learnable encoder parameters by over 97 percent-and a dual-decoder architecture in which one decoder captures global long-range semantics with an expanded receptive field while the other captures fine local details such as edges and textures; fusing these complementary cues yields saliency maps that combine global coherence with local precision.
What carries the argument
The adapter-guided frozen SAMv2 encoder together with the dual-decoder fusion, where the adapter enables targeted adaptation and the two decoders supply global semantics and local detail that are combined for the final output.
If this is right
- The method produces higher-quality saliency masks on standard SOD benchmarks than current state-of-the-art techniques.
- The same architecture also improves performance on camouflaged object detection tasks.
- Freezing the encoder and using only a small adapter keeps the total number of trainable parameters low and reduces overfitting risk.
- Global-local fusion supplies both coherent region detection and precise boundary localization in the output masks.
Where Pith is reading between the lines
- The same frozen-plus-adapter pattern could be tested on other dense-prediction tasks such as semantic segmentation or monocular depth estimation to check whether global-local decoding transfers.
- Replacing SAMv2 with a larger or differently pre-trained vision foundation model might further improve results if the adapter remains lightweight.
- The low parameter count suggests the model could run on resource-constrained devices for real-time applications once the decoder fusion is optimized for speed.
- Adding explicit boundary supervision during adapter training could strengthen the local decoder's contribution without increasing overall compute.
Load-bearing premise
The frozen SAMv2 encoder plus the lightweight adapter and dual-decoder fusion can reliably extract and combine saliency cues across diverse scenes without full fine-tuning or extra regularization to prevent overfitting on limited SOD data.
What would settle it
A new held-out benchmark containing scenes that differ markedly from the training distribution in which GLASSNet fails to exceed the accuracy of the strongest prior SOD methods while still using far fewer trainable parameters than those methods.
Figures
read the original abstract
Salient Object Detection (SOD) remains an essential yet underexplored task in the era of large-scale vision models. Although foundation models like SAM exhibit strong generalization, their potential for SOD is not fully realized, and training or fully fine-tuning them is computationally expensive and prone to overfitting under limited data. To overcome these challenges, we introduce GLASSNet, a Global-Local feature decoding framework that uses SAMv2 as a frozen encoder paired with a lightweight, spatially aware convolutional adapter-reducing learnable encoder parameters by over 97%. To enhance saliency quality, GLASSNet employs a dual-decoder architecture: one decoder captures global, long-range semantics with an expanded receptive field, while the other captures fine local details such as edges and textures. Fusing these complementary cues yields saliency maps that combine global coherence with local precision, producing accurate final masks. Extensive experiments on standard SOD and camouflaged object detection benchmarks show that GLASSNet surpasses state-of-the-art methods, demonstrating the power of frozen foundation models combined with targeted adaptation and global-local decoding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GLASSNet, a framework for salient object detection that pairs a frozen SAMv2 encoder with a lightweight spatially-aware convolutional adapter (reducing learnable encoder parameters by over 97%) and a dual-decoder architecture. One decoder captures global long-range semantics while the other extracts local edge and texture details; their fusion produces the final saliency maps. The central claim is that this design surpasses prior state-of-the-art methods on standard SOD and camouflaged-object-detection benchmarks.
Significance. If the reported benchmark gains hold, the work provides concrete evidence that targeted, parameter-efficient adaptation of large frozen foundation models can deliver competitive dense-prediction performance without full fine-tuning. The explicit global-local decoder split and the quantified parameter reduction are strengths that could influence subsequent adapter-based designs for other vision tasks.
minor comments (3)
- [Abstract] Abstract: the claim that GLASSNet 'surpasses state-of-the-art methods' is stated without any numerical results, metric names, or benchmark identifiers. Adding one or two key quantitative highlights would make the abstract self-contained and easier to evaluate.
- [Method] Method section: the precise mechanism for fusing the global and local decoder outputs (addition, concatenation, attention, etc.) is described only at a high level. An explicit equation or pseudocode block would remove ambiguity about how complementary cues are combined.
- [Experiments] Experiments: while benchmark tables are supplied, the manuscript should confirm that all compared methods were evaluated under identical protocols (same training/validation splits, same post-processing) and should report standard SOD metrics (MAE, max/mean F-measure, E-measure, S-measure) uniformly across tables.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of GLASSNet and the recommendation for minor revision. The recognition of our parameter-efficient adaptation strategy and the global-local decoder split is appreciated, as these elements aim to demonstrate practical benefits of frozen foundation models for dense prediction tasks.
Circularity Check
No significant circularity in derivation or validation chain
full rationale
The paper introduces GLASSNet as an architectural proposal: a frozen SAMv2 encoder augmented by a lightweight spatially-aware convolutional adapter and a dual-decoder (global semantics + local detail) fusion module. All load-bearing claims are empirical, resting on benchmark tables comparing against prior SOTA on standard SOD and camouflaged-object datasets. No equations, predictions, or uniqueness theorems are defined in terms of the target outputs; no self-citations serve as the sole justification for core premises; and no fitted parameters are relabeled as independent predictions. The validation is externally falsifiable via the reported metrics and is therefore independent of the model's own construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption SAMv2 encoder features are sufficiently general for SOD when paired with a lightweight adapter
Lean theorems connected to this paper
-
Cost.FunctionalEquation (J(x)=½(x+x⁻¹)−1)washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
L = L_wIoU(S, G) + L_wBCE(S, G)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Image Vis
Asheghi, B., et al.: Dasod: Detail-aware salient object detection. Image Vis. Com- put.148, 105154 (2024)
2024
-
[2]
Computational visual media 5(2), 117–150 (2019)
Borji, A., et al.: Salient object detection: A survey. Computational visual media 5(2), 117–150 (2019)
2019
-
[3]
arXiv preprint arXiv:2308.05426 (2023)
Cui, R., He, S., Qiu, S.: Adaptive low rank adaptation of segment anything to salient object detection. arXiv preprint arXiv:2308.05426 (2023)
-
[4]
arXiv preprint arXiv:2310.00702 (2023)
Dong, Y., et al.: You do not need additional priors in camouflage object detection. arXiv preprint arXiv:2310.00702 (2023)
-
[5]
In: ICLR (2021)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
2021
-
[6]
In: CVPR
Fan, D.P., et al.: Camouflaged object detection. In: CVPR. pp. 2777–2787 (2020)
2020
-
[7]
Pattern Recognition Letters162, 81–88 (2022)
Feng, X., et al.: Local to global feature learning for salient object detection. Pattern Recognition Letters162, 81–88 (2022)
2022
-
[8]
Image Vis
Ge, Y., et al.: Camouflaged object detection via cross-level refinement and inter- action network. Image Vis. Comput.144, 104973 (2024)
2024
-
[9]
Image Vis
Ge, Y., et al.: Camouflaged object detection via location-awareness and feature fusion. Image Vis. Comput.152, 105339 (2024)
2024
-
[10]
arXiv preprint arXiv:2305.00278 (2023)
Han, D., et al.: Segment anything model (sam) meets glass: Mirror and transparent objects cannot be easily detected. arXiv preprint arXiv:2305.00278 (2023)
-
[11]
In: CVPR
He, C., et al.: Camouflaged object detection with feature decomposition and edge reconstruction. In: CVPR. pp. 22046–22055 (2023)
2023
-
[12]
In: AAAI
Hu, X., et al.: High-resolution iterative feedback network for camouflaged object detection. In: AAAI. vol. 37, pp. 881–889 (2023)
2023
-
[13]
In: ICCV
Huang, Z., et al.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV. pp. 603–612 (2019) 14 Moradi et al
2019
-
[14]
arXiv preprint arXiv:2304.06022 (2023)
Ji, G.P., et al.: Sam struggles in concealed scenes–empirical study on segment anything. arXiv preprint arXiv:2304.06022 (2023)
-
[15]
Ji, W., et al.: Segment anything is not always perfect: An investigation of sam on different real-world applications (2024)
2024
-
[16]
Information Sciences584, 399–416 (2022)
Ji, Y., et al.: Lgcnet: A local-to-global context-aware feature augmentation network for salient object detection. Information Sciences584, 399–416 (2022)
2022
-
[17]
In: CVPR
Jia, Q., et al.: Segment, magnify and reiterate: Detecting camouflaged objects the hard way. In: CVPR. pp. 4713–4722 (2022)
2022
-
[18]
Information Fusion122, 103184 (2025)
Kazmierczak, R., Berthier, E., Frehse, G., Franchi, G.: Explainability and vision foundation models: A survey. Information Fusion122, 103184 (2025)
2025
-
[19]
Image Vis
Khan, R., et al.: Pyramidal attention with progressive multi-stage iterative feature refinement for salient object segmentation. Image Vis. Comput. p. 105670 (2025)
2025
-
[20]
In: ICCV
Kirillov, A., et al.: Segment anything. In: ICCV. pp. 4015–4026 (2023)
2023
-
[21]
CVIU 184, 45–56 (2019)
Le, T.N., et al.: Anabranch network for camouflaged object segmentation. CVIU 184, 45–56 (2019)
2019
-
[22]
In: CVPR
Li, G., Yu, Y.: Visual saliency based on multiscale deep features. In: CVPR. pp. 5455–5463 (2015)
2015
-
[23]
Pattern Recognition p
Li, M., et al.: Ifa: Illumination-aware feature aggregation model for salient object detection. Pattern Recognition p. 112118 (2025)
2025
-
[24]
In: CVPR
Li, Y., et al.: The secrets of salient object segmentation. In: CVPR. pp. 280–287 (2014)
2014
-
[25]
In: WACV
Liu, J., Zhang, J., Barnes, N.: Modeling aleatoric uncertainty for camouflaged object detection. In: WACV. pp. 1445–1454 (2022)
2022
-
[26]
IEEE PAMI33(2), 353–367 (2010)
Liu, T., et al.: Learning to detect a salient object. IEEE PAMI33(2), 353–367 (2010)
2010
-
[27]
Electronic Research Archive31(3) (2024)
Liu, X., Huang, X.: Weakly supervised salient object detection via bounding-box annotation and sam model. Electronic Research Archive31(3) (2024)
2024
-
[28]
In: ICCV
Liu, Z., et al.: Swin transformer: Hierarchical vision transformer using shifted win- dows. In: ICCV. pp. 10012–10022 (2021)
2021
-
[29]
IEEE Trans
Liu, Z., et al.: Ssfam: Scribble supervised salient object detection family. IEEE Trans. Multimed. (2025)
2025
-
[30]
In: CVPR
Lv, Y., et al.: Simultaneously localize, segment and rank the camouflaged objects. In: CVPR. pp. 11591–11601 (2021)
2021
-
[31]
IEEE TCSVT33(7), 3462–3476 (2023)
Lv, Y., et al.: Toward deeper understanding of camouflaged object detection. IEEE TCSVT33(7), 3462–3476 (2023)
2023
-
[32]
IEEE Trans
Lyu, Y., et al.: Uedg: Uncertainty-edge dual guided camouflage object detection. IEEE Trans. Multimed.26, 4050–4060 (2023)
2023
-
[33]
IEEE Trans
Ma, M., et al.: Boosting broader receptive fields for salient object detection. IEEE Trans. Image Process32, 1026–1038 (2023)
2023
-
[34]
In: ICPR
Moradi, M., Moradi, M., Rundo, F., Spampinato, C., Borji, A., Palazzo, S.: Salfom: Dynamic saliency prediction with video foundation models. In: ICPR. pp. 33–48. Springer (2024)
2024
-
[35]
In: ICML
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763. PmLR (2021)
2021
-
[36]
SAM 2: Segment Anything in Images and Videos
Ravi, N., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)
work page internal anchor Pith review arXiv 2024
-
[37]
IEEE Trans
Ren, Q., et al.: Salient object detection by fusing local and global contexts. IEEE Trans. Multimed.23, 1442–1453 (2020)
2020
-
[38]
In: ICML
Ryali, C., et al.: Hiera: A hierarchical vision transformer without the bells-and- whistles. In: ICML. pp. 29441–29454. PMLR (2023) Global-Local Decoding with SAM2 for SOD 15
2023
-
[39]
Boundary-guided camouflaged object detection,
Sun, Y., et al.: Boundary-guided camouflaged object detection. arXiv preprint arXiv:2207.00794 (2022)
-
[40]
Pattern Recognition48(10), 3258–3267 (2015)
Tong, N., Lu, H., Zhang, Y., Ruan, X.: Salient object detection via global and local cues. Pattern Recognition48(10), 3258–3267 (2015)
2015
-
[41]
Applied Intelligence55(4), 277 (2025)
Wang, B., et al.: A novel embedded cross framework for high-resolution salient object detection. Applied Intelligence55(4), 277 (2025)
2025
-
[42]
NeurIPS35, 5696–5710 (2022)
Wang, J., et al.: Omnivl: One foundation model for image-language and video- language tasks. NeurIPS35, 5696–5710 (2022)
2022
-
[43]
In: CVPR
Wang, L., et al.: Learning to detect salient objects with image-level supervision. In: CVPR. pp. 136–145 (2017)
2017
-
[44]
In: ACM-MM
Wang, Q., et al.: Depth-aided camouflaged object detection. In: ACM-MM. pp. 3297–3306 (2023)
2023
-
[45]
In: CVPR
Wang, Y., et al.: Pixels, regions, and objects: Multiple enhancement for salient object detection. In: CVPR. pp. 10031–10040 (2023)
2023
-
[46]
Neural Comput Appl.34(14), 11789–11806 (2022)
Wang,Z.,etal.:Tf-sod:Anoveltransformerframeworkforsalientobjectdetection. Neural Comput Appl.34(14), 11789–11806 (2022)
2022
-
[47]
In: AAAI
Wei, J., Wang, S., Huang, Q.: F3net: fusion, feedback and focus for salient object detection. In: AAAI. vol. 34, pp. 12321–12328 (2020)
2020
-
[48]
Image Vis
Wu, S., Zhang, G., Liu, X.: Swinsod: Salient object detection using swin- transformer. Image Vis. Comput.146, 105039 (2024)
2024
-
[49]
IEEE Trans
Wu, Y.H., et al.: Edn: Salient object detection via extremely-downsampled net- work. IEEE Trans. Image Process31, 3125–3136 (2022)
2022
-
[50]
IEEE TCSVT33(10), 5444–5457 (2023)
Xing, H., et al.: Go closer to see better: Camouflaged object detection via object area amplification and figure-ground conversion. IEEE TCSVT33(10), 5444–5457 (2023)
2023
-
[51]
In: CVPR
Yan, Q., Xu, L., Shi, J., Jia, J.: Hierarchical saliency detection. In: CVPR. pp. 1155–1162 (2013)
2013
-
[52]
In: CVPR
Yang, C., et al.: Saliency detection via graph-based manifold ranking. In: CVPR. pp. 3166–3173 (2013)
2013
-
[53]
Pattern Recognition150, 110330 (2024)
Yi, Y., et al.: Gponet: A two-stream gated progressive optimization network for salient object detection. Pattern Recognition150, 110330 (2024)
2024
-
[54]
arXiv preprint arXiv:2205.11283 (2022)
Yun, Y.K., Lin, W.: Selfreformer: Self-refined network with transformer for salient object detection. arXiv preprint arXiv:2205.11283 (2022)
-
[55]
Knowledge-Based Systems p
Zhang, H., et al.: Coddiff: Prior leading diffusion model for camouflage object detection. Knowledge-Based Systems p. 113381 (2025)
2025
-
[56]
IEEE TCSVT34(1), 534–548 (2023)
Zhang, L., Zhang, Q.: Salient object detection with edge-guided learning and spe- cific aggregation. IEEE TCSVT34(1), 534–548 (2023)
2023
-
[57]
arXiv preprint arXiv:2412.04243 (2024)
Zhang, Y., et al.: Quantifying the limits of segmentation foundation models: Mod- eling challenges in segmenting tree-like and low-contrast objects. arXiv preprint arXiv:2412.04243 (2024)
-
[58]
Neural Process
Zhang, Y., Zhang, Z., Liu, T., Kong, J.: Category-aware saliency enhance learning based on clip for weakly supervised salient object detection. Neural Process. Lett. 56(2), 49 (2024)
2024
-
[59]
In: PRCV
Zhou, F., Huang, B., Qiu, G.: Vision-language knowledge exploration for video saliency prediction. In: PRCV. pp. 191–205. Springer (2024)
2024
-
[60]
arXiv preprint arXiv:2412.10943 (2024)
Zhou, Z., et al.: Unconstrained salient and camouflaged object detection. arXiv preprint arXiv:2412.10943 (2024)
-
[61]
Pattern Recognition150, 110328 (2024)
Zhu, G., Li, J., Guo, Y.: Separate first, then segment: An integrity segmentation network for salient object detection. Pattern Recognition150, 110328 (2024)
2024
-
[62]
Engineering Applications of Artificial Intelligence131, 107820 (2024)
Zhu, G., Wang, L., Tang, J.: Learning discriminative context for salient object detection. Engineering Applications of Artificial Intelligence131, 107820 (2024)
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.