arxiv: 2605.02616 · v1 · submitted 2026-05-04 · 💻 cs.CV

Global-Local Feature Decoding with Adapter-Guided SAMv2 for Salient Object Detection

Morteza Moradi , Mohammad Moradi , Simone Palazzo , Ali Borji , Concetto Spampinato This is my paper

Pith reviewed 2026-05-08 18:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords salient object detectionSAMv2foundation modelsadapter tuningglobal-local decodingcamouflaged object detectionfrozen encodersdual decoder fusion

0 comments p. Extension

The pith

GLASSNet freezes SAMv2, adds a lightweight adapter cutting parameters by over 97 percent, and fuses global and local decoders to produce more accurate saliency maps than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GLASSNet as a way to apply foundation models to salient object detection without the usual costs of full training or fine-tuning on small datasets. It keeps the SAMv2 encoder frozen, inserts a small spatially aware convolutional adapter for task-specific tuning, and routes features through two separate decoders—one that integrates broad context over a wide field and one that preserves sharp local cues like edges—before combining their outputs into the final mask. Experiments across standard SOD and camouflaged detection benchmarks show this setup beats existing approaches. A reader would care because the design keeps compute low while still harvesting the generalization power of large pre-trained models, offering a practical route for dense prediction tasks where labeled data is scarce.

Core claim

GLASSNet is a Global-Local feature decoding framework that uses SAMv2 as a frozen encoder paired with a lightweight, spatially aware convolutional adapter-reducing learnable encoder parameters by over 97 percent-and a dual-decoder architecture in which one decoder captures global long-range semantics with an expanded receptive field while the other captures fine local details such as edges and textures; fusing these complementary cues yields saliency maps that combine global coherence with local precision.

What carries the argument

The adapter-guided frozen SAMv2 encoder together with the dual-decoder fusion, where the adapter enables targeted adaptation and the two decoders supply global semantics and local detail that are combined for the final output.

If this is right

The method produces higher-quality saliency masks on standard SOD benchmarks than current state-of-the-art techniques.
The same architecture also improves performance on camouflaged object detection tasks.
Freezing the encoder and using only a small adapter keeps the total number of trainable parameters low and reduces overfitting risk.
Global-local fusion supplies both coherent region detection and precise boundary localization in the output masks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same frozen-plus-adapter pattern could be tested on other dense-prediction tasks such as semantic segmentation or monocular depth estimation to check whether global-local decoding transfers.
Replacing SAMv2 with a larger or differently pre-trained vision foundation model might further improve results if the adapter remains lightweight.
The low parameter count suggests the model could run on resource-constrained devices for real-time applications once the decoder fusion is optimized for speed.
Adding explicit boundary supervision during adapter training could strengthen the local decoder's contribution without increasing overall compute.

Load-bearing premise

The frozen SAMv2 encoder plus the lightweight adapter and dual-decoder fusion can reliably extract and combine saliency cues across diverse scenes without full fine-tuning or extra regularization to prevent overfitting on limited SOD data.

What would settle it

A new held-out benchmark containing scenes that differ markedly from the training distribution in which GLASSNet fails to exceed the accuracy of the strongest prior SOD methods while still using far fewer trainable parameters than those methods.

Figures

Figures reproduced from arXiv: 2605.02616 by Ali Borji, Concetto Spampinato, Mohammad Moradi, Morteza Moradi, Simone Palazzo.

**Figure 1.** Figure 1: Intermediate outputs of GLASSNet. Columns (a)–(f) show the input im view at source ↗

**Figure 2.** Figure 2: Schematic overview of GLASSNet for saliency detection. A frozen SAMv2 view at source ↗

**Figure 3.** Figure 3: A lightweight adapter is embedded into the SAMv2 image encoder. It uses view at source ↗

**Figure 4.** Figure 4: The AMFM captures global context using Criss-Cross Attention and fuses view at source ↗

**Figure 5.** Figure 5: Qualitative comparisons show our model effectively reconstructs local and view at source ↗

read the original abstract

Salient Object Detection (SOD) remains an essential yet underexplored task in the era of large-scale vision models. Although foundation models like SAM exhibit strong generalization, their potential for SOD is not fully realized, and training or fully fine-tuning them is computationally expensive and prone to overfitting under limited data. To overcome these challenges, we introduce GLASSNet, a Global-Local feature decoding framework that uses SAMv2 as a frozen encoder paired with a lightweight, spatially aware convolutional adapter-reducing learnable encoder parameters by over 97%. To enhance saliency quality, GLASSNet employs a dual-decoder architecture: one decoder captures global, long-range semantics with an expanded receptive field, while the other captures fine local details such as edges and textures. Fusing these complementary cues yields saliency maps that combine global coherence with local precision, producing accurate final masks. Extensive experiments on standard SOD and camouflaged object detection benchmarks show that GLASSNet surpasses state-of-the-art methods, demonstrating the power of frozen foundation models combined with targeted adaptation and global-local decoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GLASSNet freezes SAMv2, adds a lightweight adapter and a global-local dual decoder to cut parameters by 97% while claiming better SOD results, but the abstract leaves the actual numbers and ablations for the full paper to verify.

read the letter

GLASSNet freezes SAMv2 as the encoder, inserts a small spatially-aware convolutional adapter, and runs two separate decoders—one focused on long-range semantics and the other on local edges and textures—before fusing the outputs into the final saliency map. The design keeps over 97% of the encoder parameters frozen, which directly addresses the cost and overfitting risks that come with full fine-tuning on limited SOD data. This pairing of adapter tuning with an explicit global-local split is a reasonable extension of existing adapter work and multi-branch decoder patterns rather than a complete shift. The paper does a solid job laying out the architecture, parameter counts, and the motivation for avoiding full retraining, and the stress-test note indicates the manuscript includes benchmark tables that test the construction. If those tables show consistent gains across standard SOD and camouflaged-object datasets with proper baselines and ablations, the empirical side holds up. The central assumption—that the adapter can inject saliency-specific cues into the frozen features and the fusion step reliably combines the two streams—looks plausible on the description and does not contain internal contradictions. The soft spot is that the abstract itself supplies no metrics, so the strength of the SOTA claim rests entirely on the results section. Minor concerns would be whether the fusion method is justified beyond simple concatenation and whether any failure modes or dataset-specific drops appear in the analysis. This is the sort of targeted, resource-aware adaptation paper that fits computer vision researchers working on efficient use of foundation models for dense prediction. A reader looking for concrete recipes on adapter-guided SAM variants would find the details useful. It shows honest engagement with the practical constraints of the task and builds on cited prior work without circular claims. I would send it for peer review so referees can check the experimental details and ablations directly.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces GLASSNet, a framework for salient object detection that pairs a frozen SAMv2 encoder with a lightweight spatially-aware convolutional adapter (reducing learnable encoder parameters by over 97%) and a dual-decoder architecture. One decoder captures global long-range semantics while the other extracts local edge and texture details; their fusion produces the final saliency maps. The central claim is that this design surpasses prior state-of-the-art methods on standard SOD and camouflaged-object-detection benchmarks.

Significance. If the reported benchmark gains hold, the work provides concrete evidence that targeted, parameter-efficient adaptation of large frozen foundation models can deliver competitive dense-prediction performance without full fine-tuning. The explicit global-local decoder split and the quantified parameter reduction are strengths that could influence subsequent adapter-based designs for other vision tasks.

minor comments (3)

[Abstract] Abstract: the claim that GLASSNet 'surpasses state-of-the-art methods' is stated without any numerical results, metric names, or benchmark identifiers. Adding one or two key quantitative highlights would make the abstract self-contained and easier to evaluate.
[Method] Method section: the precise mechanism for fusing the global and local decoder outputs (addition, concatenation, attention, etc.) is described only at a high level. An explicit equation or pseudocode block would remove ambiguity about how complementary cues are combined.
[Experiments] Experiments: while benchmark tables are supplied, the manuscript should confirm that all compared methods were evaluated under identical protocols (same training/validation splits, same post-processing) and should report standard SOD metrics (MAE, max/mean F-measure, E-measure, S-measure) uniformly across tables.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of GLASSNet and the recommendation for minor revision. The recognition of our parameter-efficient adaptation strategy and the global-local decoder split is appreciated, as these elements aim to demonstrate practical benefits of frozen foundation models for dense prediction tasks.

Circularity Check

0 steps flagged

No significant circularity in derivation or validation chain

full rationale

The paper introduces GLASSNet as an architectural proposal: a frozen SAMv2 encoder augmented by a lightweight spatially-aware convolutional adapter and a dual-decoder (global semantics + local detail) fusion module. All load-bearing claims are empirical, resting on benchmark tables comparing against prior SOTA on standard SOD and camouflaged-object datasets. No equations, predictions, or uniqueness theorems are defined in terms of the target outputs; no self-citations serve as the sole justification for core premises; and no fitted parameters are relabeled as independent predictions. The validation is externally falsifiable via the reported metrics and is therefore independent of the model's own construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that SAMv2 features remain useful for saliency when only a small adapter is trained, plus the usual deep-learning premise that benchmark performance indicates generalization. No explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption SAMv2 encoder features are sufficiently general for SOD when paired with a lightweight adapter
The design freezes the encoder and relies on the adapter to bridge the domain gap.

pith-pipeline@v0.9.0 · 5499 in / 1143 out tokens · 63981 ms · 2026-05-08T18:12:28.772286+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation (J(x)=½(x+x⁻¹)−1) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

L = L_wIoU(S, G) + L_wBCE(S, G)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 9 canonical work pages · 1 internal anchor

[1]

Image Vis

Asheghi, B., et al.: Dasod: Detail-aware salient object detection. Image Vis. Com- put.148, 105154 (2024)

2024
[2]

Computational visual media 5(2), 117–150 (2019)

Borji, A., et al.: Salient object detection: A survey. Computational visual media 5(2), 117–150 (2019)

2019
[3]

arXiv preprint arXiv:2308.05426 (2023)

Cui, R., He, S., Qiu, S.: Adaptive low rank adaptation of segment anything to salient object detection. arXiv preprint arXiv:2308.05426 (2023)

work page arXiv 2023
[4]

arXiv preprint arXiv:2310.00702 (2023)

Dong, Y., et al.: You do not need additional priors in camouflage object detection. arXiv preprint arXiv:2310.00702 (2023)

work page arXiv 2023
[5]

In: ICLR (2021)

Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)

2021
[6]

In: CVPR

Fan, D.P., et al.: Camouflaged object detection. In: CVPR. pp. 2777–2787 (2020)

2020
[7]

Pattern Recognition Letters162, 81–88 (2022)

Feng, X., et al.: Local to global feature learning for salient object detection. Pattern Recognition Letters162, 81–88 (2022)

2022
[8]

Image Vis

Ge, Y., et al.: Camouflaged object detection via cross-level refinement and inter- action network. Image Vis. Comput.144, 104973 (2024)

2024
[9]

Image Vis

Ge, Y., et al.: Camouflaged object detection via location-awareness and feature fusion. Image Vis. Comput.152, 105339 (2024)

2024
[10]

arXiv preprint arXiv:2305.00278 (2023)

Han, D., et al.: Segment anything model (sam) meets glass: Mirror and transparent objects cannot be easily detected. arXiv preprint arXiv:2305.00278 (2023)

work page arXiv 2023
[11]

In: CVPR

He, C., et al.: Camouflaged object detection with feature decomposition and edge reconstruction. In: CVPR. pp. 22046–22055 (2023)

2023
[12]

In: AAAI

Hu, X., et al.: High-resolution iterative feedback network for camouflaged object detection. In: AAAI. vol. 37, pp. 881–889 (2023)

2023
[13]

In: ICCV

Huang, Z., et al.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV. pp. 603–612 (2019) 14 Moradi et al

2019
[14]

arXiv preprint arXiv:2304.06022 (2023)

Ji, G.P., et al.: Sam struggles in concealed scenes–empirical study on segment anything. arXiv preprint arXiv:2304.06022 (2023)

work page arXiv 2023
[15]

Ji, W., et al.: Segment anything is not always perfect: An investigation of sam on different real-world applications (2024)

2024
[16]

Information Sciences584, 399–416 (2022)

Ji, Y., et al.: Lgcnet: A local-to-global context-aware feature augmentation network for salient object detection. Information Sciences584, 399–416 (2022)

2022
[17]

In: CVPR

Jia, Q., et al.: Segment, magnify and reiterate: Detecting camouflaged objects the hard way. In: CVPR. pp. 4713–4722 (2022)

2022
[18]

Information Fusion122, 103184 (2025)

Kazmierczak, R., Berthier, E., Frehse, G., Franchi, G.: Explainability and vision foundation models: A survey. Information Fusion122, 103184 (2025)

2025
[19]

Image Vis

Khan, R., et al.: Pyramidal attention with progressive multi-stage iterative feature refinement for salient object segmentation. Image Vis. Comput. p. 105670 (2025)

2025
[20]

In: ICCV

Kirillov, A., et al.: Segment anything. In: ICCV. pp. 4015–4026 (2023)

2023
[21]

CVIU 184, 45–56 (2019)

Le, T.N., et al.: Anabranch network for camouflaged object segmentation. CVIU 184, 45–56 (2019)

2019
[22]

In: CVPR

Li, G., Yu, Y.: Visual saliency based on multiscale deep features. In: CVPR. pp. 5455–5463 (2015)

2015
[23]

Pattern Recognition p

Li, M., et al.: Ifa: Illumination-aware feature aggregation model for salient object detection. Pattern Recognition p. 112118 (2025)

2025
[24]

In: CVPR

Li, Y., et al.: The secrets of salient object segmentation. In: CVPR. pp. 280–287 (2014)

2014
[25]

In: WACV

Liu, J., Zhang, J., Barnes, N.: Modeling aleatoric uncertainty for camouflaged object detection. In: WACV. pp. 1445–1454 (2022)

2022
[26]

IEEE PAMI33(2), 353–367 (2010)

Liu, T., et al.: Learning to detect a salient object. IEEE PAMI33(2), 353–367 (2010)

2010
[27]

Electronic Research Archive31(3) (2024)

Liu, X., Huang, X.: Weakly supervised salient object detection via bounding-box annotation and sam model. Electronic Research Archive31(3) (2024)

2024
[28]

In: ICCV

Liu, Z., et al.: Swin transformer: Hierarchical vision transformer using shifted win- dows. In: ICCV. pp. 10012–10022 (2021)

2021
[29]

IEEE Trans

Liu, Z., et al.: Ssfam: Scribble supervised salient object detection family. IEEE Trans. Multimed. (2025)

2025
[30]

In: CVPR

Lv, Y., et al.: Simultaneously localize, segment and rank the camouflaged objects. In: CVPR. pp. 11591–11601 (2021)

2021
[31]

IEEE TCSVT33(7), 3462–3476 (2023)

Lv, Y., et al.: Toward deeper understanding of camouflaged object detection. IEEE TCSVT33(7), 3462–3476 (2023)

2023
[32]

IEEE Trans

Lyu, Y., et al.: Uedg: Uncertainty-edge dual guided camouflage object detection. IEEE Trans. Multimed.26, 4050–4060 (2023)

2023
[33]

IEEE Trans

Ma, M., et al.: Boosting broader receptive fields for salient object detection. IEEE Trans. Image Process32, 1026–1038 (2023)

2023
[34]

In: ICPR

Moradi, M., Moradi, M., Rundo, F., Spampinato, C., Borji, A., Palazzo, S.: Salfom: Dynamic saliency prediction with video foundation models. In: ICPR. pp. 33–48. Springer (2024)

2024
[35]

In: ICML

Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763. PmLR (2021)

2021
[36]

SAM 2: Segment Anything in Images and Videos

Ravi, N., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

work page internal anchor Pith review arXiv 2024
[37]

IEEE Trans

Ren, Q., et al.: Salient object detection by fusing local and global contexts. IEEE Trans. Multimed.23, 1442–1453 (2020)

2020
[38]

In: ICML

Ryali, C., et al.: Hiera: A hierarchical vision transformer without the bells-and- whistles. In: ICML. pp. 29441–29454. PMLR (2023) Global-Local Decoding with SAM2 for SOD 15

2023
[39]

Boundary-guided camouflaged object detection,

Sun, Y., et al.: Boundary-guided camouflaged object detection. arXiv preprint arXiv:2207.00794 (2022)

work page arXiv 2022
[40]

Pattern Recognition48(10), 3258–3267 (2015)

Tong, N., Lu, H., Zhang, Y., Ruan, X.: Salient object detection via global and local cues. Pattern Recognition48(10), 3258–3267 (2015)

2015
[41]

Applied Intelligence55(4), 277 (2025)

Wang, B., et al.: A novel embedded cross framework for high-resolution salient object detection. Applied Intelligence55(4), 277 (2025)

2025
[42]

NeurIPS35, 5696–5710 (2022)

Wang, J., et al.: Omnivl: One foundation model for image-language and video- language tasks. NeurIPS35, 5696–5710 (2022)

2022
[43]

In: CVPR

Wang, L., et al.: Learning to detect salient objects with image-level supervision. In: CVPR. pp. 136–145 (2017)

2017
[44]

In: ACM-MM

Wang, Q., et al.: Depth-aided camouflaged object detection. In: ACM-MM. pp. 3297–3306 (2023)

2023
[45]

In: CVPR

Wang, Y., et al.: Pixels, regions, and objects: Multiple enhancement for salient object detection. In: CVPR. pp. 10031–10040 (2023)

2023
[46]

Neural Comput Appl.34(14), 11789–11806 (2022)

Wang,Z.,etal.:Tf-sod:Anoveltransformerframeworkforsalientobjectdetection. Neural Comput Appl.34(14), 11789–11806 (2022)

2022
[47]

In: AAAI

Wei, J., Wang, S., Huang, Q.: F3net: fusion, feedback and focus for salient object detection. In: AAAI. vol. 34, pp. 12321–12328 (2020)

2020
[48]

Image Vis

Wu, S., Zhang, G., Liu, X.: Swinsod: Salient object detection using swin- transformer. Image Vis. Comput.146, 105039 (2024)

2024
[49]

IEEE Trans

Wu, Y.H., et al.: Edn: Salient object detection via extremely-downsampled net- work. IEEE Trans. Image Process31, 3125–3136 (2022)

2022
[50]

IEEE TCSVT33(10), 5444–5457 (2023)

Xing, H., et al.: Go closer to see better: Camouflaged object detection via object area amplification and figure-ground conversion. IEEE TCSVT33(10), 5444–5457 (2023)

2023
[51]

In: CVPR

Yan, Q., Xu, L., Shi, J., Jia, J.: Hierarchical saliency detection. In: CVPR. pp. 1155–1162 (2013)

2013
[52]

In: CVPR

Yang, C., et al.: Saliency detection via graph-based manifold ranking. In: CVPR. pp. 3166–3173 (2013)

2013
[53]

Pattern Recognition150, 110330 (2024)

Yi, Y., et al.: Gponet: A two-stream gated progressive optimization network for salient object detection. Pattern Recognition150, 110330 (2024)

2024
[54]

arXiv preprint arXiv:2205.11283 (2022)

Yun, Y.K., Lin, W.: Selfreformer: Self-refined network with transformer for salient object detection. arXiv preprint arXiv:2205.11283 (2022)

work page arXiv 2022
[55]

Knowledge-Based Systems p

Zhang, H., et al.: Coddiff: Prior leading diffusion model for camouflage object detection. Knowledge-Based Systems p. 113381 (2025)

2025
[56]

IEEE TCSVT34(1), 534–548 (2023)

Zhang, L., Zhang, Q.: Salient object detection with edge-guided learning and spe- cific aggregation. IEEE TCSVT34(1), 534–548 (2023)

2023
[57]

arXiv preprint arXiv:2412.04243 (2024)

Zhang, Y., et al.: Quantifying the limits of segmentation foundation models: Mod- eling challenges in segmenting tree-like and low-contrast objects. arXiv preprint arXiv:2412.04243 (2024)

work page arXiv 2024
[58]

Neural Process

Zhang, Y., Zhang, Z., Liu, T., Kong, J.: Category-aware saliency enhance learning based on clip for weakly supervised salient object detection. Neural Process. Lett. 56(2), 49 (2024)

2024
[59]

In: PRCV

Zhou, F., Huang, B., Qiu, G.: Vision-language knowledge exploration for video saliency prediction. In: PRCV. pp. 191–205. Springer (2024)

2024
[60]

arXiv preprint arXiv:2412.10943 (2024)

Zhou, Z., et al.: Unconstrained salient and camouflaged object detection. arXiv preprint arXiv:2412.10943 (2024)

work page arXiv 2024
[61]

Pattern Recognition150, 110328 (2024)

Zhu, G., Li, J., Guo, Y.: Separate first, then segment: An integrity segmentation network for salient object detection. Pattern Recognition150, 110328 (2024)

2024
[62]

Engineering Applications of Artificial Intelligence131, 107820 (2024)

Zhu, G., Wang, L., Tang, J.: Learning discriminative context for salient object detection. Engineering Applications of Artificial Intelligence131, 107820 (2024)

2024