Global-Local Feature Decoding with Adapter-Guided SAMv2 for Salient Object Detection
Pith reviewed 2026-05-08 18:12 UTC · model grok-4.3
The pith
GLASSNet freezes SAMv2, adds a lightweight adapter cutting parameters by over 97 percent, and fuses global and local decoders to produce more accurate saliency maps than prior methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GLASSNet is a Global-Local feature decoding framework that uses SAMv2 as a frozen encoder paired with a lightweight, spatially aware convolutional adapter-reducing learnable encoder parameters by over 97 percent-and a dual-decoder architecture in which one decoder captures global long-range semantics with an expanded receptive field while the other captures fine local details such as edges and textures; fusing these complementary cues yields saliency maps that combine global coherence with local precision.
What carries the argument
The adapter-guided frozen SAMv2 encoder together with the dual-decoder fusion, where the adapter enables targeted adaptation and the two decoders supply global semantics and local detail that are combined for the final output.
If this is right
- The method produces higher-quality saliency masks on standard SOD benchmarks than current state-of-the-art techniques.
- The same architecture also improves performance on camouflaged object detection tasks.
- Freezing the encoder and using only a small adapter keeps the total number of trainable parameters low and reduces overfitting risk.
- Global-local fusion supplies both coherent region detection and precise boundary localization in the output masks.
Where Pith is reading between the lines
- The same frozen-plus-adapter pattern could be tested on other dense-prediction tasks such as semantic segmentation or monocular depth estimation to check whether global-local decoding transfers.
- Replacing SAMv2 with a larger or differently pre-trained vision foundation model might further improve results if the adapter remains lightweight.
- The low parameter count suggests the model could run on resource-constrained devices for real-time applications once the decoder fusion is optimized for speed.
- Adding explicit boundary supervision during adapter training could strengthen the local decoder's contribution without increasing overall compute.
Load-bearing premise
The frozen SAMv2 encoder plus the lightweight adapter and dual-decoder fusion can reliably extract and combine saliency cues across diverse scenes without full fine-tuning or extra regularization to prevent overfitting on limited SOD data.
What would settle it
A new held-out benchmark containing scenes that differ markedly from the training distribution in which GLASSNet fails to exceed the accuracy of the strongest prior SOD methods while still using far fewer trainable parameters than those methods.
Figures
read the original abstract
Salient Object Detection (SOD) remains an essential yet underexplored task in the era of large-scale vision models. Although foundation models like SAM exhibit strong generalization, their potential for SOD is not fully realized, and training or fully fine-tuning them is computationally expensive and prone to overfitting under limited data. To overcome these challenges, we introduce GLASSNet, a Global-Local feature decoding framework that uses SAMv2 as a frozen encoder paired with a lightweight, spatially aware convolutional adapter-reducing learnable encoder parameters by over 97%. To enhance saliency quality, GLASSNet employs a dual-decoder architecture: one decoder captures global, long-range semantics with an expanded receptive field, while the other captures fine local details such as edges and textures. Fusing these complementary cues yields saliency maps that combine global coherence with local precision, producing accurate final masks. Extensive experiments on standard SOD and camouflaged object detection benchmarks show that GLASSNet surpasses state-of-the-art methods, demonstrating the power of frozen foundation models combined with targeted adaptation and global-local decoding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GLASSNet, a framework for salient object detection that pairs a frozen SAMv2 encoder with a lightweight spatially-aware convolutional adapter (reducing learnable encoder parameters by over 97%) and a dual-decoder architecture. One decoder captures global long-range semantics while the other extracts local edge and texture details; their fusion produces the final saliency maps. The central claim is that this design surpasses prior state-of-the-art methods on standard SOD and camouflaged-object-detection benchmarks.
Significance. If the reported benchmark gains hold, the work provides concrete evidence that targeted, parameter-efficient adaptation of large frozen foundation models can deliver competitive dense-prediction performance without full fine-tuning. The explicit global-local decoder split and the quantified parameter reduction are strengths that could influence subsequent adapter-based designs for other vision tasks.
minor comments (3)
- [Abstract] Abstract: the claim that GLASSNet 'surpasses state-of-the-art methods' is stated without any numerical results, metric names, or benchmark identifiers. Adding one or two key quantitative highlights would make the abstract self-contained and easier to evaluate.
- [Method] Method section: the precise mechanism for fusing the global and local decoder outputs (addition, concatenation, attention, etc.) is described only at a high level. An explicit equation or pseudocode block would remove ambiguity about how complementary cues are combined.
- [Experiments] Experiments: while benchmark tables are supplied, the manuscript should confirm that all compared methods were evaluated under identical protocols (same training/validation splits, same post-processing) and should report standard SOD metrics (MAE, max/mean F-measure, E-measure, S-measure) uniformly across tables.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of GLASSNet and the recommendation for minor revision. The recognition of our parameter-efficient adaptation strategy and the global-local decoder split is appreciated, as these elements aim to demonstrate practical benefits of frozen foundation models for dense prediction tasks.
Circularity Check
No significant circularity in derivation or validation chain
full rationale
The paper introduces GLASSNet as an architectural proposal: a frozen SAMv2 encoder augmented by a lightweight spatially-aware convolutional adapter and a dual-decoder (global semantics + local detail) fusion module. All load-bearing claims are empirical, resting on benchmark tables comparing against prior SOTA on standard SOD and camouflaged-object datasets. No equations, predictions, or uniqueness theorems are defined in terms of the target outputs; no self-citations serve as the sole justification for core premises; and no fitted parameters are relabeled as independent predictions. The validation is externally falsifiable via the reported metrics and is therefore independent of the model's own construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption SAMv2 encoder features are sufficiently general for SOD when paired with a lightweight adapter
Lean theorems connected to this paper
-
Cost.FunctionalEquation (J(x)=½(x+x⁻¹)−1)washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
L = L_wIoU(S, G) + L_wBCE(S, G)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Computational visual media 5(2), 117–150 (2019)
Borji, A., et al.: Salient object detection: A survey. Computational visual media 5(2), 117–150 (2019)
work page 2019
-
[3]
arXiv preprint arXiv:2308.05426 (2023)
Cui, R., He, S., Qiu, S.: Adaptive low rank adaptation of segment anything to salient object detection. arXiv preprint arXiv:2308.05426 (2023)
-
[4]
arXiv preprint arXiv:2310.00702 (2023)
Dong, Y., et al.: You do not need additional priors in camouflage object detection. arXiv preprint arXiv:2310.00702 (2023)
-
[5]
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
work page 2021
- [6]
-
[7]
Pattern Recognition Letters162, 81–88 (2022)
Feng, X., et al.: Local to global feature learning for salient object detection. Pattern Recognition Letters162, 81–88 (2022)
work page 2022
- [8]
- [9]
-
[10]
arXiv preprint arXiv:2305.00278 (2023)
Han, D., et al.: Segment anything model (sam) meets glass: Mirror and transparent objects cannot be easily detected. arXiv preprint arXiv:2305.00278 (2023)
- [11]
- [12]
- [13]
-
[14]
arXiv preprint arXiv:2304.06022 (2023)
Ji, G.P., et al.: Sam struggles in concealed scenes–empirical study on segment anything. arXiv preprint arXiv:2304.06022 (2023)
-
[15]
Ji, W., et al.: Segment anything is not always perfect: An investigation of sam on different real-world applications (2024)
work page 2024
-
[16]
Information Sciences584, 399–416 (2022)
Ji, Y., et al.: Lgcnet: A local-to-global context-aware feature augmentation network for salient object detection. Information Sciences584, 399–416 (2022)
work page 2022
- [17]
-
[18]
Information Fusion122, 103184 (2025)
Kazmierczak, R., Berthier, E., Frehse, G., Franchi, G.: Explainability and vision foundation models: A survey. Information Fusion122, 103184 (2025)
work page 2025
- [19]
- [20]
-
[21]
Le, T.N., et al.: Anabranch network for camouflaged object segmentation. CVIU 184, 45–56 (2019)
work page 2019
- [22]
-
[23]
Li, M., et al.: Ifa: Illumination-aware feature aggregation model for salient object detection. Pattern Recognition p. 112118 (2025)
work page 2025
- [24]
- [25]
-
[26]
IEEE PAMI33(2), 353–367 (2010)
Liu, T., et al.: Learning to detect a salient object. IEEE PAMI33(2), 353–367 (2010)
work page 2010
-
[27]
Electronic Research Archive31(3) (2024)
Liu, X., Huang, X.: Weakly supervised salient object detection via bounding-box annotation and sam model. Electronic Research Archive31(3) (2024)
work page 2024
- [28]
-
[29]
Liu, Z., et al.: Ssfam: Scribble supervised salient object detection family. IEEE Trans. Multimed. (2025)
work page 2025
- [30]
-
[31]
IEEE TCSVT33(7), 3462–3476 (2023)
Lv, Y., et al.: Toward deeper understanding of camouflaged object detection. IEEE TCSVT33(7), 3462–3476 (2023)
work page 2023
-
[32]
Lyu, Y., et al.: Uedg: Uncertainty-edge dual guided camouflage object detection. IEEE Trans. Multimed.26, 4050–4060 (2023)
work page 2023
-
[33]
Ma, M., et al.: Boosting broader receptive fields for salient object detection. IEEE Trans. Image Process32, 1026–1038 (2023)
work page 2023
- [34]
- [35]
-
[36]
SAM 2: Segment Anything in Images and Videos
Ravi, N., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)
work page internal anchor Pith review arXiv 2024
-
[37]
Ren, Q., et al.: Salient object detection by fusing local and global contexts. IEEE Trans. Multimed.23, 1442–1453 (2020)
work page 2020
- [38]
-
[39]
Boundary-guided camouflaged object detection,
Sun, Y., et al.: Boundary-guided camouflaged object detection. arXiv preprint arXiv:2207.00794 (2022)
-
[40]
Pattern Recognition48(10), 3258–3267 (2015)
Tong, N., Lu, H., Zhang, Y., Ruan, X.: Salient object detection via global and local cues. Pattern Recognition48(10), 3258–3267 (2015)
work page 2015
-
[41]
Applied Intelligence55(4), 277 (2025)
Wang, B., et al.: A novel embedded cross framework for high-resolution salient object detection. Applied Intelligence55(4), 277 (2025)
work page 2025
-
[42]
Wang, J., et al.: Omnivl: One foundation model for image-language and video- language tasks. NeurIPS35, 5696–5710 (2022)
work page 2022
- [43]
-
[44]
Wang, Q., et al.: Depth-aided camouflaged object detection. In: ACM-MM. pp. 3297–3306 (2023)
work page 2023
- [45]
-
[46]
Neural Comput Appl.34(14), 11789–11806 (2022)
Wang,Z.,etal.:Tf-sod:Anoveltransformerframeworkforsalientobjectdetection. Neural Comput Appl.34(14), 11789–11806 (2022)
work page 2022
- [47]
- [48]
-
[49]
Wu, Y.H., et al.: Edn: Salient object detection via extremely-downsampled net- work. IEEE Trans. Image Process31, 3125–3136 (2022)
work page 2022
-
[50]
IEEE TCSVT33(10), 5444–5457 (2023)
Xing, H., et al.: Go closer to see better: Camouflaged object detection via object area amplification and figure-ground conversion. IEEE TCSVT33(10), 5444–5457 (2023)
work page 2023
- [51]
- [52]
-
[53]
Pattern Recognition150, 110330 (2024)
Yi, Y., et al.: Gponet: A two-stream gated progressive optimization network for salient object detection. Pattern Recognition150, 110330 (2024)
work page 2024
-
[54]
arXiv preprint arXiv:2205.11283 (2022)
Yun, Y.K., Lin, W.: Selfreformer: Self-refined network with transformer for salient object detection. arXiv preprint arXiv:2205.11283 (2022)
-
[55]
Zhang, H., et al.: Coddiff: Prior leading diffusion model for camouflage object detection. Knowledge-Based Systems p. 113381 (2025)
work page 2025
-
[56]
IEEE TCSVT34(1), 534–548 (2023)
Zhang, L., Zhang, Q.: Salient object detection with edge-guided learning and spe- cific aggregation. IEEE TCSVT34(1), 534–548 (2023)
work page 2023
-
[57]
Zhang, Y., et al.: Quantifying the limits of segmentation foundation models: Mod- eling challenges in segmenting tree-like and low-contrast objects. arXiv preprint arXiv:2412.04243 (2024)
-
[58]
Zhang, Y., Zhang, Z., Liu, T., Kong, J.: Category-aware saliency enhance learning based on clip for weakly supervised salient object detection. Neural Process. Lett. 56(2), 49 (2024)
work page 2024
- [59]
-
[60]
arXiv preprint arXiv:2412.10943 (2024)
Zhou, Z., et al.: Unconstrained salient and camouflaged object detection. arXiv preprint arXiv:2412.10943 (2024)
-
[61]
Pattern Recognition150, 110328 (2024)
Zhu, G., Li, J., Guo, Y.: Separate first, then segment: An integrity segmentation network for salient object detection. Pattern Recognition150, 110328 (2024)
work page 2024
-
[62]
Engineering Applications of Artificial Intelligence131, 107820 (2024)
Zhu, G., Wang, L., Tang, J.: Learning discriminative context for salient object detection. Engineering Applications of Artificial Intelligence131, 107820 (2024)
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.