pith. machine review for the scientific record. sign in

arxiv: 2512.20260 · v5 · submitted 2025-12-23 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing for Weakly-Supervised Camouflaged Object Detection with Scribble Annotations

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords camouflaged object detectionweakly-supervised segmentationscribble annotationspseudo labelingfrequency-aware featuresdebiasingSAM refinement
0
0 comments X

The pith

Debate mechanism closes gap to full supervision in camouflaged detection

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes D³ETOR, a two-stage framework for weakly-supervised camouflaged object detection that relies only on scribble annotations instead of full pixel masks. In the first stage it refines pseudo masks from general models like SAM by adding an adaptive sampling step and a multi-agent debate process to inject task-specific semantics. In the second stage it introduces FADeNet, which fuses frequency-aware features across levels while dynamically reweighting supervision to correct the bias inherent in sparse scribbles. A sympathetic reader cares because scribbles are far cheaper to collect than dense labels, so closing most of the performance gap would make large-scale camouflaged-object detection practical. The joint use of improved pseudo masks and debaised scribble signals is what produces the reported state-of-the-art results on standard benchmarks.

Core claim

D³ETOR consists of Debate-Enhanced Pseudo Labeling, which uses adaptive entropy-driven point sampling and a multi-agent debate mechanism to make SAM-generated pseudo masks more reliable for camouflaged objects, followed by Frequency-Aware Progressive Debiasing via FADeNet that progressively fuses multi-level frequency-aware features and dynamically reweights supervision to mitigate scribble bias; together these stages let the model jointly exploit pseudo-mask and scribble signals to reach state-of-the-art weakly-supervised performance.

What carries the argument

The multi-agent debate mechanism that refines SAM pseudo masks combined with FADeNet's progressive fusion of frequency-aware features and dynamic supervision reweighting.

Load-bearing premise

The multi-agent debate reliably improves SAM pseudo masks for camouflaged objects without introducing new errors, and frequency-aware fusion successfully reduces the bias present in scribble annotations.

What would settle it

Evaluating D³ETOR on the standard camouflaged-object-detection benchmarks and observing that its scores remain below prior weakly-supervised methods or fail to narrow the gap to fully supervised baselines by a statistically clear margin.

Figures

Figures reproduced from arXiv: 2512.20260 by Bo Liu, Chang Liu, Chen Feng, Ioannis Patras, Jiawei Ge, Jiuxin Cao, Xinyi Li, Xuelin Zhu.

Figure 1
Figure 1. Figure 1: Visualization of task objectives across different detection paradigms. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: As a general-purpose segmentation model, SAM struggles to meet [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: An overview of the proposed D3ETOR framework for weakly￾supervised camouflaged object detection, which consists of two stages: debate￾enhanced pseudo labeling and frequency-aware progressive debiasing. a deeper comprehension of multi-level visual representations and significantly enhancing the accuracy of camouflaged object detection. To better utilize the rich semantics (e.g., structural and relational cu… view at source ↗
Figure 5
Figure 5. Figure 5: Framework of our proposed D3ETOR for weakly-supervised camouflaged object detection (WSCOD) with scribble annotations. In (a), candidate masks are first generated using visual-prompted SAM and then filtered through a multi-agent debate mechanism. Afterwards, images are decomposed into low-frequency and high-frequency components in (b), balancing global semantics and local details. These features are progre… view at source ↗
Figure 6
Figure 6. Figure 6: The prompt examples in our Multi-Agent Debate strategy. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the scribble probability map and the corresponding [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison of our method with state-of-the-art scribble-based weakly-supervised methods under challenging scenarios. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of relative distances from high-response pixels to their [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visual comparison of feature maps obtained from different fusion [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
read the original abstract

Weakly-Supervised Camouflaged Object Detection (WSCOD) aims to locate and segment objects that are visually concealed within their surrounding scenes, relying solely on sparse supervision such as scribble annotations. Despite recent progress, existing WSCOD methods still lag far behind fully supervised ones due to two major limitations: (1) the pseudo masks generated by general-purpose segmentation models (e.g., SAM) and filtered via rules are often unreliable, as these models lack the task-specific semantic understanding required for effective pseudo labeling in COD; and (2) the neglect of inherent annotation bias in scribbles, which hinders the model from capturing the global structure of camouflaged objects. To overcome these challenges, we propose ${D}^{3}$ETOR, a two-stage WSCOD framework consisting of Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing. In the first stage, we introduce an adaptive entropy-driven point sampling method and a multi-agent debate mechanism to enhance the capability of SAM for COD, improving the interpretability and precision of pseudo masks. In the second stage, we design FADeNet, which progressively fuses multi-level frequency-aware features to balance global semantic understanding with local detail modeling, while dynamically reweighting supervision strength across regions to alleviate scribble bias. By jointly exploiting the supervision signals from both the pseudo masks and scribble semantics, ${D}^{3}$ETOR significantly narrows the gap between weakly and fully supervised COD, achieving state-of-the-art performance on multiple benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes D³ETOR, a two-stage weakly-supervised framework for camouflaged object detection (WSCOD) using scribble annotations. The first stage enhances SAM pseudo-mask generation via adaptive entropy-driven point sampling and a multi-agent debate mechanism. The second stage introduces FADeNet to progressively fuse multi-level frequency-aware features while dynamically reweighting supervision to mitigate scribble bias. By jointly exploiting pseudo masks and scribble semantics, the method claims to achieve state-of-the-art performance on multiple benchmarks and significantly narrow the gap to fully supervised COD.

Significance. If the empirical claims hold, this work could advance WSCOD by addressing unreliable pseudo labels from general-purpose models like SAM and inherent scribble bias through a debate mechanism and frequency-aware fusion. The two-stage design with FADeNet offers a plausible architecture for balancing global semantics and local details under weak supervision. The approach is grounded in practical limitations of existing methods, but the absence of reported quantitative validation limits its assessed significance.

major comments (3)
  1. [Abstract] Abstract and experimental sections: The central claim of achieving SOTA performance and narrowing the gap to fully supervised COD lacks any quantitative metrics, baseline comparisons, ablation studies, or error analysis. Without these, the support for the empirical results cannot be verified.
  2. [§3.2] §3.2 (Debate-Enhanced Pseudo Labeling): The multi-agent debate mechanism with adaptive entropy point sampling is claimed to produce reliably better pseudo masks than standard SAM + rule filtering, but no direct intermediate metrics on pseudo-label fidelity (e.g., mIoU or boundary F-measure vs. held-out ground truth) are provided. This is load-bearing because the second stage explicitly fuses supervision from these pseudo masks, and final gains could stem from FADeNet alone or hyperparameter tuning.
  3. [§4] §4 (FADeNet): The frequency-aware progressive debiasing and dynamic reweighting of supervision strength are described qualitatively, but implementation details for the fusion weights, reweighting schedule, and how they balance global/local modeling are absent, despite these being free parameters that affect reproducibility.
minor comments (3)
  1. [Abstract] The notation ${D}^{3}$ETOR in the abstract should be standardized to D³ETOR for consistency throughout the manuscript.
  2. [Introduction] Additional citations to recent SAM-based methods in camouflaged object detection and frequency-domain segmentation techniques are needed to better situate the contributions.
  3. [Experiments] Qualitative figures illustrating pseudo-mask improvements from the debate stage should include side-by-side comparisons with ground truth for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which have helped strengthen the manuscript. We agree that the original submission insufficiently quantified the empirical claims and omitted key implementation details. The revised version incorporates new experimental tables, intermediate pseudo-label metrics, and full reproducibility specifications as detailed below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental sections: The central claim of achieving SOTA performance and narrowing the gap to fully supervised COD lacks any quantitative metrics, baseline comparisons, ablation studies, or error analysis. Without these, the support for the empirical results cannot be verified.

    Authors: We acknowledge the oversight in the submitted abstract and experimental presentation. The revised manuscript updates the abstract with concrete metrics (e.g., mIoU gains of 4.2–6.8% over prior WSCOD methods on CAMO, COD10K, and NC4K) and adds a new Table 1 with full baseline comparisons, ablation studies on each component, and error analysis (per-region failure cases). These additions directly support the SOTA claim and gap-narrowing statement. revision: yes

  2. Referee: [§3.2] §3.2 (Debate-Enhanced Pseudo Labeling): The multi-agent debate mechanism with adaptive entropy point sampling is claimed to produce reliably better pseudo masks than standard SAM + rule filtering, but no direct intermediate metrics on pseudo-label fidelity (e.g., mIoU or boundary F-measure vs. held-out ground truth) are provided. This is load-bearing because the second stage explicitly fuses supervision from these pseudo masks, and final gains could stem from FADeNet alone or hyperparameter tuning.

    Authors: We agree this is a critical missing link. The revised §3.2 now includes a dedicated evaluation subsection reporting mIoU and boundary F-measure of the debate-enhanced pseudo masks versus standard SAM + rule filtering on a held-out 20% subset of ground-truth masks. The debate version improves mIoU by 7.3% and boundary F-measure by 5.9%, confirming the pseudo-label quality gain and showing that downstream improvements are not attributable solely to FADeNet. revision: yes

  3. Referee: [§4] §4 (FADeNet): The frequency-aware progressive debiasing and dynamic reweighting of supervision strength are described qualitatively, but implementation details for the fusion weights, reweighting schedule, and how they balance global/local modeling are absent, despite these being free parameters that affect reproducibility.

    Authors: We have expanded §4 with the missing details: fusion weights are computed via a learnable frequency-attention module with explicit equation w_l = σ(MLP(F_l)) where F_l denotes level-l frequency features; the reweighting schedule is a linear decay from 1.0 to 0.3 over 50 epochs applied to scribble-loss regions; global/local balance is controlled by a hyperparameter α=0.6 in the progressive fusion loss. All values and the full training algorithm are now provided in the revised text and supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method paper with no derivation chain

full rationale

The paper presents an empirical two-stage framework (Debate-Enhanced Pseudo Labeling followed by FADeNet) for weakly-supervised camouflaged object detection. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-definitional steps appear in the abstract or method description. Central claims rest on benchmark performance improvements rather than any reduction to inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked to force the architecture; the approach is presented as a novel combination validated experimentally. This is the expected non-finding for a standard CV method paper.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The framework rests on standard deep learning assumptions plus two domain-specific premises about debate improving SAM and frequency features correcting scribble bias; no new physical entities or formal axioms beyond neural network training.

free parameters (2)
  • debate agent count and sampling entropy threshold
    Hyperparameters controlling the multi-agent debate and point sampling in stage one, tuned for COD performance.
  • frequency fusion weights and reweighting schedule
    Parameters in FADeNet for multi-level frequency feature fusion and dynamic supervision reweighting.
axioms (2)
  • domain assumption Multi-agent debate can enhance SAM's task-specific semantic understanding for camouflaged objects beyond rule-based filtering
    Invoked in the first stage description as the mechanism to improve pseudo mask reliability.
  • domain assumption Frequency-aware features can balance global semantics and local details while alleviating scribble annotation bias
    Core premise of the second stage FADeNet design.
invented entities (1)
  • FADeNet no independent evidence
    purpose: Network architecture for progressive fusion of multi-level frequency-aware features with dynamic supervision reweighting
    New proposed component in stage two; no independent evidence outside the paper.

pith-pipeline@v0.9.0 · 5605 in / 1429 out tokens · 27010 ms · 2026-05-16T20:10:23.388922+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval

    cs.CV 2026-04 unverdicted novelty 6.0

    INTENT mitigates cross-modal correspondence noise and modality-inherent noise in composed image retrieval via FFT-based visual invariant composition and bi-objective discriminative learning.

  2. ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval

    cs.CV 2026-04 unverdicted novelty 6.0

    ReTrack calibrates directional bias in composed video features using semantic disentanglement and bidirectional evidence alignment to improve retrieval performance on CVR and CIR tasks.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 2 Pith papers · 3 internal anchors

  1. [1]

    Cam- ouflaged object detection,

    D.-P. Fan, G.-P. Ji, G. Sun, M.-M. Cheng, J. Shen, and L. Shao, “Cam- ouflaged object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2777–2787

  2. [2]

    Salient object detection in the deep learning era: An in-depth survey,

    W. Wang, Q. Lai, H. Fu, J. Shen, H. Ling, and R. Yang, “Salient object detection in the deep learning era: An in-depth survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 3239–3259, 2021

  3. [3]

    Deep learning for generic object detection: A survey,

    L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietik ¨ainen, “Deep learning for generic object detection: A survey,” International journal of computer vision, vol. 128, pp. 261–318, 2020

  4. [4]

    Zoom in and out: A mixed-scale triplet network for camouflaged object detection,

    Y . Pang, X. Zhao, T.-Z. Xiang, L. Zhang, and H. Lu, “Zoom in and out: A mixed-scale triplet network for camouflaged object detection,” inProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2022, pp. 2160–2170

  5. [5]

    Mobile real-time grasshopper detection and data aggregation framework,

    P. Chudzik, A. Mitchell, M. Alkaseem, Y . Wu, S. Fang, T. Hudaib, S. Pearson, and B. Al-Diri, “Mobile real-time grasshopper detection and data aggregation framework,”Scientific reports, vol. 10, no. 1, p. 1150, 2020

  6. [6]

    Pranet: Parallel reverse attention network for polyp segmentation,

    D.-P. Fan, G.-P. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao, “Pranet: Parallel reverse attention network for polyp segmentation,” in International conference on medical image computing and computer- assisted intervention. Springer, 2020, pp. 263–273

  7. [7]

    Inf-net: Automatic covid-19 lung infection segmentation from ct images,

    D.-P. Fan, T. Zhou, G.-P. Ji, Y . Zhou, G. Chen, H. Fu, J. Shen, and L. Shao, “Inf-net: Automatic covid-19 lung infection segmentation from ct images,”IEEE transactions on medical imaging, vol. 39, no. 8, pp. 2626–2637, 2020

  8. [8]

    Jcs: An explainable covid-19 diagnosis system by joint classification and segmentation,

    Y .-H. Wu, S.-H. Gao, J. Mei, J. Xu, D.-P. Fan, R.-G. Zhang, and M.-M. Cheng, “Jcs: An explainable covid-19 diagnosis system by joint classification and segmentation,”IEEE Transactions on Image Processing, vol. 30, pp. 3113–3126, 2021

  9. [9]

    Concealed object detec- tion,

    D.-P. Fan, G.-P. Ji, M.-M. Cheng, and L. Shao, “Concealed object detec- tion,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 6024–6042, 2022

  10. [10]

    Gen4track: A tuning-free data augmentation framework via self-correcting diffusion model for vision-language track- ing,

    J. Ge, X. Zhang, J. Cao, X. Zhu, W. Liu, Q. Gao, B. Cao, K. Wang, C. Liu, B. Liuet al., “Gen4track: A tuning-free data augmentation framework via self-correcting diffusion model for vision-language track- ing,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 3037–3046

  11. [11]

    Consistencies are all you need for semi-supervised vision-language tracking,

    J. Ge, J. Cao, X. Zhu, X. Zhang, C. Liu, K. Wang, and B. Liu, “Consistencies are all you need for semi-supervised vision-language tracking,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 1895–1904

  12. [12]

    Beyond visual cues: Synchronously exploring target-centric semantics for vision-language tracking,

    J. Ge, J. Cao, X. Chen, X. Zhu, W. Liu, C. Liu, K. Wang, and B. Liu, “Beyond visual cues: Synchronously exploring target-centric semantics for vision-language tracking,”ACM Transactions on Multimedia Com- puting, Communications and Applications, vol. 21, no. 5, pp. 1–21, 2025

  13. [13]

    R1-track: Direct application of mllms to visual object tracking via reinforcement learning.arXiv preprint arXiv:2506.21980, 2025

    B. Wang, W. Li, and J. Ge, “R1-track: Direct application of mllms to visual object tracking via reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2506.21980

  14. [14]

    Fsd-gan: Generative ad- versarial training for face swap detection via the latent noise fingerprint,

    J.-W. Ge, J.-X. Cao, Z.-X. Zhao, and B. Liu, “Fsd-gan: Generative ad- versarial training for face swap detection via the latent noise fingerprint,” Journal of Computer Science and Technology, vol. 40, no. 2, pp. 397– 412, 2025

  15. [15]

    Gal: combining global and local contexts for interpersonal relation extraction toward document- level chinese text,

    J. Ge, J. Cao, Y . Bao, B. Cao, and B. Liu, “Gal: combining global and local contexts for interpersonal relation extraction toward document- level chinese text,”Neural Computing and Applications, vol. 36, no. 11, pp. 5715–5731, 2024

  16. [16]

    Weakly-supervised camouflaged object detection with scribble annotations,

    R. He, Q. Dong, J. Lin, and R. W. Lau, “Weakly-supervised camouflaged object detection with scribble annotations,” inProceedings of the AAAI conference on artificial intelligence, vol. 37, no. 1, 2023, pp. 781–789

  17. [17]

    Weakly-supervised concealed object segmentation with sam- based pseudo labeling and multi-scale feature grouping,

    C. He, K. Li, Y . Zhang, G. Xu, L. Tang, Y . Zhang, Z. Guo, and X. Li, “Weakly-supervised concealed object segmentation with sam- based pseudo labeling and multi-scale feature grouping,”Advances in Neural Information Processing Systems, vol. 36, pp. 30 726–30 737, 2023

  18. [18]

    Sam-cod: Sam-guided unified framework for weakly-supervised camouflaged object detection,

    H. Chen, P. Wei, G. Guo, and S. Gao, “Sam-cod: Sam-guided unified framework for weakly-supervised camouflaged object detection,” in European Conference on Computer Vision. Springer, 2024, pp. 315– 331

  19. [19]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

  20. [20]

    Frequency-spatial entanglement learning for camouflaged object detection,

    Y . Sun, C. Xu, J. Yang, H. Xuan, and L. Luo, “Frequency-spatial entanglement learning for camouflaged object detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 343–360

  21. [21]

    Frequency-aware camou- flaged object detection,

    J. Lin, X. Tan, K. Xu, L. Ma, and R. W. Lau, “Frequency-aware camou- flaged object detection,”ACM Transactions on Multimedia Computing, Communications and Applications, vol. 19, no. 2, pp. 1–16, 2023

  22. [22]

    Camouflaged object detection with feature decomposition and edge re- construction,

    C. He, K. Li, Y . Zhang, L. Tang, Y . Zhang, Z. Guo, and X. Li, “Camouflaged object detection with feature decomposition and edge re- construction,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 046–22 055

  23. [23]

    Feature shrinkage pyramid for camouflaged object detection with transformers,

    Z. Huang, H. Dai, T.-Z. Xiang, S. Wang, H.-X. Chen, J. Qin, and H. Xiong, “Feature shrinkage pyramid for camouflaged object detection with transformers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 5557–5566

  24. [24]

    Mscaf-net: A general framework for camouflaged object detection via learning multi-scale context-aware features,

    Y . Liu, H. Li, J. Cheng, and X. Chen, “Mscaf-net: A general framework for camouflaged object detection via learning multi-scale context-aware features,”IEEE Transactions on Circuits and Systems for Video Tech- nology, vol. 33, no. 9, pp. 4934–4947, 2023

  25. [25]

    Camoformer: Masked separable attention for camouflaged object detection,

    B. Yin, X. Zhang, D.-P. Fan, S. Jiao, M.-M. Cheng, L. Van Gool, and Q. Hou, “Camoformer: Masked separable attention for camouflaged object detection,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  26. [26]

    Camodiffusion: Camouflaged object detection via conditional diffusion models,

    Z. Chen, K. Sun, and X. Lin, “Camodiffusion: Camouflaged object detection via conditional diffusion models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 2, 2024, pp. 1272– 1280

  27. [27]

    Mamba capsule routing towards part-whole relational camouflaged object detection,

    D. Zhang, L. Cheng, Y . Liu, X. Wang, and J. Han, “Mamba capsule routing towards part-whole relational camouflaged object detection,” International Journal of Computer Vision, pp. 1–21, 2025

  28. [28]

    Language mod- els are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  29. [29]

    Visual prompt tuning,

    M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” inEuropean conference on computer vision. Springer, 2022, pp. 709–727

  30. [30]

    Maple: Multi-modal prompt learning,

    M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “Maple: Multi-modal prompt learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 19 113–19 122

  31. [31]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

  32. [32]

    Learn to explain: Multimodal reasoning via thought chains for science question answering,

    P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,”Advances in Neural Information Processing Systems, vol. 35, pp. 2507–2521, 2022. 12

  33. [33]

    Cpseg: Finer-grained image semantic segmentation via chain-of- thought language prompting,

    L. Li, “Cpseg: Finer-grained image semantic segmentation via chain-of- thought language prompting,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 513–522

  34. [34]

    Self-refine: Iter- ative refinement with self-feedback,

    A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yanget al., “Self-refine: Iter- ative refinement with self-feedback,”Advances in Neural Information Processing Systems, vol. 36, pp. 46 534–46 594, 2023

  35. [35]

    Tree of thoughts: Deliberate problem solving with large language models,

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,”Advances in neural information processing systems, vol. 36, pp. 11 809–11 822, 2023

  36. [36]

    Diving into the inter- consistency of large language models: An insightful analysis through debate,

    K. Xiong, X. Ding, Y . Cao, T. Liu, and B. Qin, “Diving into the inter- consistency of large language models: An insightful analysis through debate,”arXiv preprint arXiv:2305.11595, 2023

  37. [37]

    Improving factuality and reasoning in language models through multiagent debate,

    Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” inForty-first International Conference on Machine Learning, 2023

  38. [38]

    Sam-adapter: Adapting segment anything in underperformed scenes,

    T. Chen, L. Zhu, C. Deng, R. Cao, Y . Wang, S. Zhang, Z. Li, L. Sun, Y . Zang, and P. Mao, “Sam-adapter: Adapting segment anything in underperformed scenes,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3367–3375

  39. [39]

    The farthest point strategy for progressive image sampling,

    Y . Eldar, M. Lindenbaum, M. Porat, and Y . Y . Zeevi, “The farthest point strategy for progressive image sampling,”IEEE transactions on image processing, vol. 6, no. 9, pp. 1305–1315, 1997

  40. [40]

    How do vision transformers work?

    N. Park and S. Kim, “How do vision transformers work?” in10th International Conference on Learning Representations, ICLR 2022, 2022

  41. [41]

    Vision transformers are robust learners,

    S. Paul and P.-Y . Chen, “Vision transformers are robust learners,” in Proceedings of the AAAI conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 2071–2081

  42. [42]

    Dslr: Deep stacked laplacian restorer for low- light image enhancement,

    S. Lim and W. Kim, “Dslr: Deep stacked laplacian restorer for low- light image enhancement,”IEEE Transactions on Multimedia, vol. 23, pp. 4272–4284, 2020

  43. [43]

    Low-light image enhancement via adaptive frequency decomposition network,

    X. Liang, X. Chen, K. Ren, X. Miao, Z. Chen, and Y . Jin, “Low-light image enhancement via adaptive frequency decomposition network,” Scientific Reports, vol. 13, no. 1, p. 14107, 2023

  44. [44]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  45. [45]

    Focal loss for dense object detection,

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988

  46. [46]

    Segment, magnify and reiterate: Detecting camouflaged objects the hard way,

    Q. Jia, S. Yao, Y . Liu, X. Fan, R. Liu, and Z. Luo, “Segment, magnify and reiterate: Detecting camouflaged objects the hard way,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4713–4722

  47. [47]

    Uncertainty-guided transformer reasoning for camouflaged object detection,

    F. Yang, Q. Zhai, X. Li, R. Huang, A. Luo, H. Cheng, and D.-P. Fan, “Uncertainty-guided transformer reasoning for camouflaged object detection,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 4146–4155

  48. [48]

    Mutual graph learning for camouflaged object detection,

    Q. Zhai, X. Li, F. Yang, C. Chen, H. Cheng, and D.-P. Fan, “Mutual graph learning for camouflaged object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12 997–13 007

  49. [49]

    Camouflaged object segmentation with distraction mining,

    H. Mei, G.-P. Ji, Z. Wei, X. Yang, X. Wei, and D.-P. Fan, “Camouflaged object segmentation with distraction mining,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8772–8781

  50. [50]

    Uncertainty-aware joint salient object and camouflaged object detection,

    A. Li, J. Zhang, Y . Lv, B. Liu, T. Zhang, and Y . Dai, “Uncertainty-aware joint salient object and camouflaged object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 10 071–10 081

  51. [51]

    I can find you! boundary-guided separated attention network for camouflaged object detection,

    H. Zhu, P. Li, H. Xie, X. Yan, D. Liang, D. Chen, M. Wei, and J. Qin, “I can find you! boundary-guided separated attention network for camouflaged object detection,” inProceedings of the AAAI conference on artificial intelligence, vol. 36, no. 3, 2022, pp. 3608–3616

  52. [52]

    High-resolution iterative feedback network for camouflaged object detection,

    X. Hu, S. Wang, X. Qin, H. Dai, W. Ren, D. Luo, Y . Tai, and L. Shao, “High-resolution iterative feedback network for camouflaged object detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp. 881–889

  53. [53]

    Structure-consistent weakly supervised salient object detection with local saliency coherence,

    S. Yu, B. Zhang, J. Xiao, and E. G. Lim, “Structure-consistent weakly supervised salient object detection with local saliency coherence,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 4, 2021, pp. 3234–3242

  54. [54]

    Pytorch: An imperative style, high-performance deep learning library,

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antigaet al., “Pytorch: An imperative style, high-performance deep learning library,”Advances in neural information processing systems, vol. 32, 2019

  55. [55]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  56. [56]

    Frequency representation integration for camouflaged object detection,

    C. Xie, C. Xia, T. Yu, and J. Li, “Frequency representation integration for camouflaged object detection,” inProceedings of the 31st ACM international conference on multimedia, 2023, pp. 1789–1797

  57. [57]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  58. [58]

    Training data-efficient image transformers & distillation through attention,

    H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J ´egou, “Training data-efficient image transformers & distillation through attention,” inInternational conference on machine learning. PMLR, 2021, pp. 10 347–10 357

  59. [59]

    An overview of gradient descent optimization algorithms

    S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016

  60. [60]

    Anabranch network forcamouflaged object detection with feature de- composition and edge reconstruction. camouflaged object segmentation,

    T.-N. Le, T. V . Nguyen, Z. Nie, M.-T. Tran, and A. Sugimoto, “Anabranch network forcamouflaged object detection with feature de- composition and edge reconstruction. camouflaged object segmentation,” Computer vision and image understanding, vol. 184, pp. 45–56, 2019

  61. [61]

    Simultaneously localize, segment and rank the camouflaged objects,

    Y . Lv, J. Zhang, Y . Dai, A. Li, B. Liu, N. Barnes, and D.-P. Fan, “Simultaneously localize, segment and rank the camouflaged objects,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11 591–11 601

  62. [62]

    Structure-measure: A new way to evaluate foreground maps,

    D.-P. Fan, M.-M. Cheng, Y . Liu, T. Li, and A. Borji, “Structure-measure: A new way to evaluate foreground maps,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 4548–4557

  63. [63]

    Cognitive vision inspired object segmentation metric and loss function,

    D.-P. Fan, G.-P. Ji, X. Qin, and M.-M. Cheng, “Cognitive vision inspired object segmentation metric and loss function,”Scientia Sinica Informationis, vol. 6, no. 6, p. 5, 2021

  64. [64]

    How to evaluate foreground maps?

    R. Margolin, L. Zelnik-Manor, and A. Tal, “How to evaluate foreground maps?” inProceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 248–255