pith. the verified trust layer for science. sign in

arxiv: 2605.02616 · v1 · submitted 2026-05-04 · 💻 cs.CV

Global-Local Feature Decoding with Adapter-Guided SAMv2 for Salient Object Detection

Pith reviewed 2026-05-08 18:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords salient object detectionSAMv2foundation modelsadapter tuningglobal-local decodingcamouflaged object detectionfrozen encodersdual decoder fusion
0
0 comments X p. Extension

The pith

GLASSNet freezes SAMv2, adds a lightweight adapter cutting parameters by over 97 percent, and fuses global and local decoders to produce more accurate saliency maps than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GLASSNet as a way to apply foundation models to salient object detection without the usual costs of full training or fine-tuning on small datasets. It keeps the SAMv2 encoder frozen, inserts a small spatially aware convolutional adapter for task-specific tuning, and routes features through two separate decoders—one that integrates broad context over a wide field and one that preserves sharp local cues like edges—before combining their outputs into the final mask. Experiments across standard SOD and camouflaged detection benchmarks show this setup beats existing approaches. A reader would care because the design keeps compute low while still harvesting the generalization power of large pre-trained models, offering a practical route for dense prediction tasks where labeled data is scarce.

Core claim

GLASSNet is a Global-Local feature decoding framework that uses SAMv2 as a frozen encoder paired with a lightweight, spatially aware convolutional adapter-reducing learnable encoder parameters by over 97 percent-and a dual-decoder architecture in which one decoder captures global long-range semantics with an expanded receptive field while the other captures fine local details such as edges and textures; fusing these complementary cues yields saliency maps that combine global coherence with local precision.

What carries the argument

The adapter-guided frozen SAMv2 encoder together with the dual-decoder fusion, where the adapter enables targeted adaptation and the two decoders supply global semantics and local detail that are combined for the final output.

If this is right

  • The method produces higher-quality saliency masks on standard SOD benchmarks than current state-of-the-art techniques.
  • The same architecture also improves performance on camouflaged object detection tasks.
  • Freezing the encoder and using only a small adapter keeps the total number of trainable parameters low and reduces overfitting risk.
  • Global-local fusion supplies both coherent region detection and precise boundary localization in the output masks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same frozen-plus-adapter pattern could be tested on other dense-prediction tasks such as semantic segmentation or monocular depth estimation to check whether global-local decoding transfers.
  • Replacing SAMv2 with a larger or differently pre-trained vision foundation model might further improve results if the adapter remains lightweight.
  • The low parameter count suggests the model could run on resource-constrained devices for real-time applications once the decoder fusion is optimized for speed.
  • Adding explicit boundary supervision during adapter training could strengthen the local decoder's contribution without increasing overall compute.

Load-bearing premise

The frozen SAMv2 encoder plus the lightweight adapter and dual-decoder fusion can reliably extract and combine saliency cues across diverse scenes without full fine-tuning or extra regularization to prevent overfitting on limited SOD data.

What would settle it

A new held-out benchmark containing scenes that differ markedly from the training distribution in which GLASSNet fails to exceed the accuracy of the strongest prior SOD methods while still using far fewer trainable parameters than those methods.

Figures

Figures reproduced from arXiv: 2605.02616 by Ali Borji, Concetto Spampinato, Mohammad Moradi, Morteza Moradi, Simone Palazzo.

Figure 1
Figure 1. Figure 1: Intermediate outputs of GLASSNet. Columns (a)–(f) show the input im view at source ↗
Figure 2
Figure 2. Figure 2: Schematic overview of GLASSNet for saliency detection. A frozen SAMv2 view at source ↗
Figure 3
Figure 3. Figure 3: A lightweight adapter is embedded into the SAMv2 image encoder. It uses view at source ↗
Figure 4
Figure 4. Figure 4: The AMFM captures global context using Criss-Cross Attention and fuses view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparisons show our model effectively reconstructs local and view at source ↗
read the original abstract

Salient Object Detection (SOD) remains an essential yet underexplored task in the era of large-scale vision models. Although foundation models like SAM exhibit strong generalization, their potential for SOD is not fully realized, and training or fully fine-tuning them is computationally expensive and prone to overfitting under limited data. To overcome these challenges, we introduce GLASSNet, a Global-Local feature decoding framework that uses SAMv2 as a frozen encoder paired with a lightweight, spatially aware convolutional adapter-reducing learnable encoder parameters by over 97%. To enhance saliency quality, GLASSNet employs a dual-decoder architecture: one decoder captures global, long-range semantics with an expanded receptive field, while the other captures fine local details such as edges and textures. Fusing these complementary cues yields saliency maps that combine global coherence with local precision, producing accurate final masks. Extensive experiments on standard SOD and camouflaged object detection benchmarks show that GLASSNet surpasses state-of-the-art methods, demonstrating the power of frozen foundation models combined with targeted adaptation and global-local decoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces GLASSNet, a framework for salient object detection that pairs a frozen SAMv2 encoder with a lightweight spatially-aware convolutional adapter (reducing learnable encoder parameters by over 97%) and a dual-decoder architecture. One decoder captures global long-range semantics while the other extracts local edge and texture details; their fusion produces the final saliency maps. The central claim is that this design surpasses prior state-of-the-art methods on standard SOD and camouflaged-object-detection benchmarks.

Significance. If the reported benchmark gains hold, the work provides concrete evidence that targeted, parameter-efficient adaptation of large frozen foundation models can deliver competitive dense-prediction performance without full fine-tuning. The explicit global-local decoder split and the quantified parameter reduction are strengths that could influence subsequent adapter-based designs for other vision tasks.

minor comments (3)
  1. [Abstract] Abstract: the claim that GLASSNet 'surpasses state-of-the-art methods' is stated without any numerical results, metric names, or benchmark identifiers. Adding one or two key quantitative highlights would make the abstract self-contained and easier to evaluate.
  2. [Method] Method section: the precise mechanism for fusing the global and local decoder outputs (addition, concatenation, attention, etc.) is described only at a high level. An explicit equation or pseudocode block would remove ambiguity about how complementary cues are combined.
  3. [Experiments] Experiments: while benchmark tables are supplied, the manuscript should confirm that all compared methods were evaluated under identical protocols (same training/validation splits, same post-processing) and should report standard SOD metrics (MAE, max/mean F-measure, E-measure, S-measure) uniformly across tables.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of GLASSNet and the recommendation for minor revision. The recognition of our parameter-efficient adaptation strategy and the global-local decoder split is appreciated, as these elements aim to demonstrate practical benefits of frozen foundation models for dense prediction tasks.

Circularity Check

0 steps flagged

No significant circularity in derivation or validation chain

full rationale

The paper introduces GLASSNet as an architectural proposal: a frozen SAMv2 encoder augmented by a lightweight spatially-aware convolutional adapter and a dual-decoder (global semantics + local detail) fusion module. All load-bearing claims are empirical, resting on benchmark tables comparing against prior SOTA on standard SOD and camouflaged-object datasets. No equations, predictions, or uniqueness theorems are defined in terms of the target outputs; no self-citations serve as the sole justification for core premises; and no fitted parameters are relabeled as independent predictions. The validation is externally falsifiable via the reported metrics and is therefore independent of the model's own construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that SAMv2 features remain useful for saliency when only a small adapter is trained, plus the usual deep-learning premise that benchmark performance indicates generalization. No explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption SAMv2 encoder features are sufficiently general for SOD when paired with a lightweight adapter
    The design freezes the encoder and relies on the adapter to bridge the domain gap.

pith-pipeline@v0.9.0 · 5499 in / 1143 out tokens · 63981 ms · 2026-05-08T18:12:28.772286+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    Image Vis

    Asheghi, B., et al.: Dasod: Detail-aware salient object detection. Image Vis. Com- put.148, 105154 (2024)

  2. [2]

    Computational visual media 5(2), 117–150 (2019)

    Borji, A., et al.: Salient object detection: A survey. Computational visual media 5(2), 117–150 (2019)

  3. [3]

    arXiv preprint arXiv:2308.05426 (2023)

    Cui, R., He, S., Qiu, S.: Adaptive low rank adaptation of segment anything to salient object detection. arXiv preprint arXiv:2308.05426 (2023)

  4. [4]

    arXiv preprint arXiv:2310.00702 (2023)

    Dong, Y., et al.: You do not need additional priors in camouflage object detection. arXiv preprint arXiv:2310.00702 (2023)

  5. [5]

    In: ICLR (2021)

    Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)

  6. [6]

    In: CVPR

    Fan, D.P., et al.: Camouflaged object detection. In: CVPR. pp. 2777–2787 (2020)

  7. [7]

    Pattern Recognition Letters162, 81–88 (2022)

    Feng, X., et al.: Local to global feature learning for salient object detection. Pattern Recognition Letters162, 81–88 (2022)

  8. [8]

    Image Vis

    Ge, Y., et al.: Camouflaged object detection via cross-level refinement and inter- action network. Image Vis. Comput.144, 104973 (2024)

  9. [9]

    Image Vis

    Ge, Y., et al.: Camouflaged object detection via location-awareness and feature fusion. Image Vis. Comput.152, 105339 (2024)

  10. [10]

    arXiv preprint arXiv:2305.00278 (2023)

    Han, D., et al.: Segment anything model (sam) meets glass: Mirror and transparent objects cannot be easily detected. arXiv preprint arXiv:2305.00278 (2023)

  11. [11]

    In: CVPR

    He, C., et al.: Camouflaged object detection with feature decomposition and edge reconstruction. In: CVPR. pp. 22046–22055 (2023)

  12. [12]

    In: AAAI

    Hu, X., et al.: High-resolution iterative feedback network for camouflaged object detection. In: AAAI. vol. 37, pp. 881–889 (2023)

  13. [13]

    In: ICCV

    Huang, Z., et al.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV. pp. 603–612 (2019) 14 Moradi et al

  14. [14]

    arXiv preprint arXiv:2304.06022 (2023)

    Ji, G.P., et al.: Sam struggles in concealed scenes–empirical study on segment anything. arXiv preprint arXiv:2304.06022 (2023)

  15. [15]

    Ji, W., et al.: Segment anything is not always perfect: An investigation of sam on different real-world applications (2024)

  16. [16]

    Information Sciences584, 399–416 (2022)

    Ji, Y., et al.: Lgcnet: A local-to-global context-aware feature augmentation network for salient object detection. Information Sciences584, 399–416 (2022)

  17. [17]

    In: CVPR

    Jia, Q., et al.: Segment, magnify and reiterate: Detecting camouflaged objects the hard way. In: CVPR. pp. 4713–4722 (2022)

  18. [18]

    Information Fusion122, 103184 (2025)

    Kazmierczak, R., Berthier, E., Frehse, G., Franchi, G.: Explainability and vision foundation models: A survey. Information Fusion122, 103184 (2025)

  19. [19]

    Image Vis

    Khan, R., et al.: Pyramidal attention with progressive multi-stage iterative feature refinement for salient object segmentation. Image Vis. Comput. p. 105670 (2025)

  20. [20]

    In: ICCV

    Kirillov, A., et al.: Segment anything. In: ICCV. pp. 4015–4026 (2023)

  21. [21]

    CVIU 184, 45–56 (2019)

    Le, T.N., et al.: Anabranch network for camouflaged object segmentation. CVIU 184, 45–56 (2019)

  22. [22]

    In: CVPR

    Li, G., Yu, Y.: Visual saliency based on multiscale deep features. In: CVPR. pp. 5455–5463 (2015)

  23. [23]

    Pattern Recognition p

    Li, M., et al.: Ifa: Illumination-aware feature aggregation model for salient object detection. Pattern Recognition p. 112118 (2025)

  24. [24]

    In: CVPR

    Li, Y., et al.: The secrets of salient object segmentation. In: CVPR. pp. 280–287 (2014)

  25. [25]

    In: WACV

    Liu, J., Zhang, J., Barnes, N.: Modeling aleatoric uncertainty for camouflaged object detection. In: WACV. pp. 1445–1454 (2022)

  26. [26]

    IEEE PAMI33(2), 353–367 (2010)

    Liu, T., et al.: Learning to detect a salient object. IEEE PAMI33(2), 353–367 (2010)

  27. [27]

    Electronic Research Archive31(3) (2024)

    Liu, X., Huang, X.: Weakly supervised salient object detection via bounding-box annotation and sam model. Electronic Research Archive31(3) (2024)

  28. [28]

    In: ICCV

    Liu, Z., et al.: Swin transformer: Hierarchical vision transformer using shifted win- dows. In: ICCV. pp. 10012–10022 (2021)

  29. [29]

    IEEE Trans

    Liu, Z., et al.: Ssfam: Scribble supervised salient object detection family. IEEE Trans. Multimed. (2025)

  30. [30]

    In: CVPR

    Lv, Y., et al.: Simultaneously localize, segment and rank the camouflaged objects. In: CVPR. pp. 11591–11601 (2021)

  31. [31]

    IEEE TCSVT33(7), 3462–3476 (2023)

    Lv, Y., et al.: Toward deeper understanding of camouflaged object detection. IEEE TCSVT33(7), 3462–3476 (2023)

  32. [32]

    IEEE Trans

    Lyu, Y., et al.: Uedg: Uncertainty-edge dual guided camouflage object detection. IEEE Trans. Multimed.26, 4050–4060 (2023)

  33. [33]

    IEEE Trans

    Ma, M., et al.: Boosting broader receptive fields for salient object detection. IEEE Trans. Image Process32, 1026–1038 (2023)

  34. [34]

    In: ICPR

    Moradi, M., Moradi, M., Rundo, F., Spampinato, C., Borji, A., Palazzo, S.: Salfom: Dynamic saliency prediction with video foundation models. In: ICPR. pp. 33–48. Springer (2024)

  35. [35]

    In: ICML

    Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763. PmLR (2021)

  36. [36]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

  37. [37]

    IEEE Trans

    Ren, Q., et al.: Salient object detection by fusing local and global contexts. IEEE Trans. Multimed.23, 1442–1453 (2020)

  38. [38]

    In: ICML

    Ryali, C., et al.: Hiera: A hierarchical vision transformer without the bells-and- whistles. In: ICML. pp. 29441–29454. PMLR (2023) Global-Local Decoding with SAM2 for SOD 15

  39. [39]

    Boundary-guided camouflaged object detection,

    Sun, Y., et al.: Boundary-guided camouflaged object detection. arXiv preprint arXiv:2207.00794 (2022)

  40. [40]

    Pattern Recognition48(10), 3258–3267 (2015)

    Tong, N., Lu, H., Zhang, Y., Ruan, X.: Salient object detection via global and local cues. Pattern Recognition48(10), 3258–3267 (2015)

  41. [41]

    Applied Intelligence55(4), 277 (2025)

    Wang, B., et al.: A novel embedded cross framework for high-resolution salient object detection. Applied Intelligence55(4), 277 (2025)

  42. [42]

    NeurIPS35, 5696–5710 (2022)

    Wang, J., et al.: Omnivl: One foundation model for image-language and video- language tasks. NeurIPS35, 5696–5710 (2022)

  43. [43]

    In: CVPR

    Wang, L., et al.: Learning to detect salient objects with image-level supervision. In: CVPR. pp. 136–145 (2017)

  44. [44]

    In: ACM-MM

    Wang, Q., et al.: Depth-aided camouflaged object detection. In: ACM-MM. pp. 3297–3306 (2023)

  45. [45]

    In: CVPR

    Wang, Y., et al.: Pixels, regions, and objects: Multiple enhancement for salient object detection. In: CVPR. pp. 10031–10040 (2023)

  46. [46]

    Neural Comput Appl.34(14), 11789–11806 (2022)

    Wang,Z.,etal.:Tf-sod:Anoveltransformerframeworkforsalientobjectdetection. Neural Comput Appl.34(14), 11789–11806 (2022)

  47. [47]

    In: AAAI

    Wei, J., Wang, S., Huang, Q.: F3net: fusion, feedback and focus for salient object detection. In: AAAI. vol. 34, pp. 12321–12328 (2020)

  48. [48]

    Image Vis

    Wu, S., Zhang, G., Liu, X.: Swinsod: Salient object detection using swin- transformer. Image Vis. Comput.146, 105039 (2024)

  49. [49]

    IEEE Trans

    Wu, Y.H., et al.: Edn: Salient object detection via extremely-downsampled net- work. IEEE Trans. Image Process31, 3125–3136 (2022)

  50. [50]

    IEEE TCSVT33(10), 5444–5457 (2023)

    Xing, H., et al.: Go closer to see better: Camouflaged object detection via object area amplification and figure-ground conversion. IEEE TCSVT33(10), 5444–5457 (2023)

  51. [51]

    In: CVPR

    Yan, Q., Xu, L., Shi, J., Jia, J.: Hierarchical saliency detection. In: CVPR. pp. 1155–1162 (2013)

  52. [52]

    In: CVPR

    Yang, C., et al.: Saliency detection via graph-based manifold ranking. In: CVPR. pp. 3166–3173 (2013)

  53. [53]

    Pattern Recognition150, 110330 (2024)

    Yi, Y., et al.: Gponet: A two-stream gated progressive optimization network for salient object detection. Pattern Recognition150, 110330 (2024)

  54. [54]

    arXiv preprint arXiv:2205.11283 (2022)

    Yun, Y.K., Lin, W.: Selfreformer: Self-refined network with transformer for salient object detection. arXiv preprint arXiv:2205.11283 (2022)

  55. [55]

    Knowledge-Based Systems p

    Zhang, H., et al.: Coddiff: Prior leading diffusion model for camouflage object detection. Knowledge-Based Systems p. 113381 (2025)

  56. [56]

    IEEE TCSVT34(1), 534–548 (2023)

    Zhang, L., Zhang, Q.: Salient object detection with edge-guided learning and spe- cific aggregation. IEEE TCSVT34(1), 534–548 (2023)

  57. [57]

    arXiv preprint arXiv:2412.04243 (2024)

    Zhang, Y., et al.: Quantifying the limits of segmentation foundation models: Mod- eling challenges in segmenting tree-like and low-contrast objects. arXiv preprint arXiv:2412.04243 (2024)

  58. [58]

    Neural Process

    Zhang, Y., Zhang, Z., Liu, T., Kong, J.: Category-aware saliency enhance learning based on clip for weakly supervised salient object detection. Neural Process. Lett. 56(2), 49 (2024)

  59. [59]

    In: PRCV

    Zhou, F., Huang, B., Qiu, G.: Vision-language knowledge exploration for video saliency prediction. In: PRCV. pp. 191–205. Springer (2024)

  60. [60]

    arXiv preprint arXiv:2412.10943 (2024)

    Zhou, Z., et al.: Unconstrained salient and camouflaged object detection. arXiv preprint arXiv:2412.10943 (2024)

  61. [61]

    Pattern Recognition150, 110328 (2024)

    Zhu, G., Li, J., Guo, Y.: Separate first, then segment: An integrity segmentation network for salient object detection. Pattern Recognition150, 110328 (2024)

  62. [62]

    Engineering Applications of Artificial Intelligence131, 107820 (2024)

    Zhu, G., Wang, L., Tang, J.: Learning discriminative context for salient object detection. Engineering Applications of Artificial Intelligence131, 107820 (2024)