Aqua Boundary-Saliency Attention Module for Lightweight Underwater Salient Instance Segmentation Detection Transformer

Julian Supardi; M. Fazri Nizar; Muhammad Naufal Rachmatullah

arxiv: 2606.08002 · v1 · pith:WT5GUVNHnew · submitted 2026-06-06 · 💻 cs.CV

Aqua Boundary-Saliency Attention Module for Lightweight Underwater Salient Instance Segmentation Detection Transformer

M. Fazri Nizar , Julian Supardi , Muhammad Naufal Rachmatullah This is my paper

Pith reviewed 2026-06-27 20:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords underwater instance segmentationsalient object detectiondetection transformerattention modulelightweight modelDINOv2 featuresreal-time inferencemarine perception

0 comments

The pith

Embedding underwater boundary and saliency cues into DINOv2 features lets a compact DETR match state-of-the-art segmentation accuracy at real-time speeds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LUSIS-DETR, a lightweight detection-transformer for underwater salient instance segmentation that avoids large foundation models and extra modalities. Its Aqua Boundary-Saliency Attention Module incorporates boundary, contrast, attenuation, chroma, dark-channel, and center-prior cues into DINOv2 multi-scale features through bounded residual modulation, with auxiliary mask supervision and small-object copy-paste used only at training time. Evaluations across four underwater datasets demonstrate competitively leading results under both category-aware and salient-instance protocols. The resulting model achieves 4.31-6.34 ms latency in TensorRT FP16 on an NVIDIA T4 GPU. A sympathetic reader would care because the design targets practical deployment for marine resource exploration, ecological monitoring, and robotic perception where efficiency matters.

Core claim

The central claim is that the AquaBSAM module successfully embeds the listed underwater cues into DINOv2-initialized multi-scale features via bounded residual modulation, enabling the compact LUSIS-DETR framework to deliver competitively leading performance on underwater instance segmentation without reliance on large foundation models, prompt generation, or auxiliary modalities, while supporting real-time inference.

What carries the argument

Aqua Boundary-Saliency Attention Module (AquaBSAM), which integrates boundary, contrast, attenuation, chroma, dark-channel, and center-prior cues into DINOv2 multi-scale features through bounded residual modulation.

If this is right

The method reduces dependence on large foundation models or extra modalities for underwater tasks.
Auxiliary training techniques improve results without affecting inference latency.
The framework supports real-time operation suitable for underwater robotic perception.
Results hold across both category-aware and salient-instance evaluation protocols on four datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The cue-embedding strategy could be tested on other degraded visual domains such as fog or low-light scenes.
Bounded residual modulation might serve as a general pattern for injecting domain knowledge into vision transformers.
Replacing DINOv2 with alternative backbones could further tune the accuracy-speed trade-off.

Load-bearing premise

The listed underwater cues can be embedded into the DINOv2 features via bounded residual modulation in a way that produces the reported performance gains.

What would settle it

An ablation that disables AquaBSAM entirely and then re-measures mask quality and instance discrimination on UIIS, UIIS10K, USIS10K, and USIS16K; absence of a clear performance drop would falsify the module's contribution.

Figures

Figures reproduced from arXiv: 2606.08002 by Julian Supardi, M. Fazri Nizar, Muhammad Naufal Rachmatullah.

**Figure 1.** Figure 1: LUSIS-DETR framework. The same sample flows from RGB input to instance masks. AquaBSAM computes underwater cues and applies bounded [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Qualitative examples. Each dataset contributes two high-agreement test images; rows show input, LUSIS-DETR prediction, and ground truth. Yellow [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Underwater instance segmentation integrates pixel-level mask prediction and instance-level discrimination for marine resource exploration, ecological monitoring, and underwater robotic perception. Recent prompt-based and auxiliary-modality methods improve mask quality, but their reliance on large foundation models, prompt generation, or extra modality estimation complicates efficient deployment. This work introduces Lightweight Underwater Salient Instance Segmentation Detection Transformer (LUSIS-DETR), a compact detection-transformer framework built around the Aqua Boundary-Saliency Attention Module (AquaBSAM). AquaBSAM embeds underwater boundary, contrast, attenuation, chroma, dark-channel, and center-prior cues into DINOv2-initialized multi-scale features through bounded residual modulation, while auxiliary mask supervision and small-object copy-paste are training-only. Extensive evaluation on four recent underwater instance segmentation datasets, UIIS, UIIS10K, USIS10K, and USIS16K, shows competitively leading performance against previous state-of-the-art works across category-aware and salient-instance protocols. TensorRT half-precision (FP16) benchmarking on an NVIDIA T4 graphics processing unit (GPU) achieves 4.31-6.34 milliseconds (ms) latency, supporting real-time inference under an accessible reproduction setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a domain-specific AquaBSAM module for underwater cues in a DETR setup, but DINOv2 initialization undercuts the no-large-models claim and no results or ablations are shown to support the performance assertions.

read the letter

The main things to know are that this work defines a new AquaBSAM module to fold boundary, contrast, attenuation, chroma, dark-channel, and center-prior cues into DINOv2 multi-scale features via bounded residual modulation, and it reports 4.31-6.34 ms TensorRT FP16 latency on T4. That is the concrete contribution.

The module description is clear on how the cues are injected and the training-only tricks (auxiliary masks, small-object copy-paste) are sensible for underwater data. The focus on real-time marine robotics is consistent.

The soft spots are the framing and the evidence. The abstract criticizes prior work for depending on large foundation models, yet the method starts from DINOv2 features; without an ablation that isolates the module's contribution from the backbone, the efficiency and novelty arguments do not hold up. More basically, the abstract supplies no numbers, baselines, or statistical comparisons for the four datasets, so the "competitively leading" claim cannot be evaluated.

This is for people already working on underwater instance segmentation or domain-adapted transformers who want a list of physical priors to try. A reader could extract the cue set and modulation idea, but the missing experiments make it hard to judge whether the approach is worth adopting.

I would not send it for peer review until the full paper shows the ablations and tables that separate AquaBSAM from DINOv2 and actually document the claimed gains.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces LUSIS-DETR, a compact detection-transformer for underwater salient instance segmentation, centered on the AquaBSAM module. AquaBSAM embeds boundary, contrast, attenuation, chroma, dark-channel, and center-prior cues into DINOv2-initialized multi-scale features via bounded residual modulation; auxiliary mask supervision and small-object copy-paste are used only at training time. The central claim is that this yields competitively leading results on UIIS, UIIS10K, USIS10K, and USIS16K under both category-aware and salient-instance protocols, together with TensorRT FP16 latency of 4.31-6.34 ms on an NVIDIA T4 GPU, while avoiding large foundation models or extra modalities.

Significance. If the reported performance gains are shown to arise from the AquaBSAM cue modulation rather than the DINOv2 backbone, and if the quantitative results are supplied, the work would offer a practical route to real-time underwater instance segmentation on modest hardware, addressing deployment difficulties of prompt-based or auxiliary-modality approaches.

major comments (2)

[Abstract] Abstract: the claim of 'competitively leading performance' on four datasets is asserted without any quantitative metrics, baseline tables, ablation results, or statistical comparisons, leaving the central empirical claim unsupported by visible evidence.
[Abstract] Abstract: the positioning of LUSIS-DETR as avoiding large foundation models is contradicted by the explicit use of DINOv2-initialized features; without an ablation isolating the contribution of AquaBSAM from the DINOv2 backbone, the novelty and efficiency arguments rest on an untested separation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments on the abstract below and have revised the manuscript to incorporate quantitative support and clarify the role of DINOv2.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'competitively leading performance' on four datasets is asserted without any quantitative metrics, baseline tables, ablation results, or statistical comparisons, leaving the central empirical claim unsupported by visible evidence.

Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised version, we have added specific metrics (e.g., mAP under category-aware and salient-instance protocols on UIIS, UIIS10K, USIS10K, and USIS16K, plus FP16 latency on T4) directly into the abstract, with explicit pointers to the corresponding tables and ablations in Sections 4 and 5. revision: yes
Referee: [Abstract] Abstract: the positioning of LUSIS-DETR as avoiding large foundation models is contradicted by the explicit use of DINOv2-initialized features; without an ablation isolating the contribution of AquaBSAM from the DINOv2 backbone, the novelty and efficiency arguments rest on an untested separation.

Authors: We accept that the original wording could be misinterpreted. DINOv2 serves only as a fixed feature initializer for the multi-scale backbone (analogous to ImageNet-pretrained ResNet in prior DETR variants), while inference remains lightweight and free of prompt-based foundation models or extra modalities. To directly address the isolation concern, we have added an ablation in the revised manuscript that compares the full AquaBSAM-equipped model against a DINOv2-only baseline and a DINOv2+standard attention variant, confirming the performance contribution of the cue-modulation mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with no derivation chain

full rationale

The paper introduces LUSIS-DETR and AquaBSAM as an architectural module that embeds listed underwater cues into DINOv2 features via bounded residual modulation. All central claims (leading performance on UIIS/UIIS10K/USIS10K/USIS16K, 4.31-6.34 ms latency) rest on empirical evaluation against prior SOTA under category-aware and salient-instance protocols. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing uniqueness theorems appear. The DINOv2 initialization is an explicit design choice, not a hidden tautology that forces the reported gains. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract-only review limits visibility into parameters and assumptions; the design implicitly assumes that the listed underwater cues can be usefully combined via residual modulation and that DINOv2 features form an adequate base for this modulation.

axioms (1)

domain assumption DINOv2 multi-scale features provide a suitable initialization that can be improved by bounded residual modulation with underwater cues
Invoked in the construction of AquaBSAM

invented entities (2)

AquaBSAM no independent evidence
purpose: Embed underwater boundary, contrast, attenuation, chroma, dark-channel, and center-prior cues into features
New module introduced by the paper
LUSIS-DETR no independent evidence
purpose: Compact detection-transformer framework for underwater salient instance segmentation
New framework built around AquaBSAM

pith-pipeline@v0.9.1-grok · 5754 in / 1351 out tokens · 25239 ms · 2026-06-27T20:12:19.817446+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 31 canonical work pages · 1 internal anchor

[1]

In: IEEE/CVF International Conference on Computer Vision

S. Lian, H. Li, R. Cong, S. Li, W. Zhang, and S. Kwong, “Watermask: Instance segmentation for underwater imagery,” inProc. IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 1305– 1315, doi: 10.1109/ICCV51070.2023.00126

work page doi:10.1109/iccv51070.2023.00126 2023
[2]

Diving into underwater: Segment anything model guided underwater salient instance segmentation and a large-scale dataset,

S. Lian, Z. Zhang, H. Li, W. Li, L. T. Yang, S. Kwong, and R. Cong, “Diving into underwater: Segment anything model guided underwater salient instance segmentation and a large-scale dataset,” inProc. Inter- national Conference on Machine Learning (ICML), 2024, pp. 29 545– 29 559, doi: 10.48550/arXiv.2406.06039

work page doi:10.48550/arxiv.2406.06039 2024
[3]

Advancing marine research: Uwsam framework and uiis10k dataset for precise underwater instance segmentation,

H. Li, S. Lian, Z. Li, R. Cong, C. Li, L. T. Yang, W. Zhang, and S. Kwong, “Advancing marine research: Uwsam framework and uiis10k dataset for precise underwater instance segmentation,”arXiv preprint arXiv:2505.15581, 2025, doi: 10.48550/arXiv.2505.15581

work page doi:10.48550/arxiv.2505.15581 2025
[4]

Usis16k: High-quality dataset for underwater salient instance segmentation,

L. Hong, X. Wang, Y . Li, and X. Wang, “Usis16k: High-quality dataset for underwater salient instance segmentation,”arXiv preprint arXiv:2506.19472, 2025, doi: 10.48550/arXiv.2506.19472

work page doi:10.48550/arxiv.2506.19472 2025
[5]

Empowering dino represen- tations for underwater instance segmentation via aligner and prompter,

Z. Chen, C. Zhang, H. Fang, and R. Cong, “Empowering dino represen- tations for underwater instance segmentation via aligner and prompter,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 5, pp. 3201–3209, 2026, doi: 10.1609/aaai.v40i5.37314

work page doi:10.1609/aaai.v40i5.37314 2026
[6]

Uis-mamba: Exploring mamba for underwater instance segmentation via dynamic tree scan and hidden state weaken,

R. Cong, Z. Yu, H. Fang, H. Sun, and S. Kwong, “Uis-mamba: Exploring mamba for underwater instance segmentation via dynamic tree scan and hidden state weaken,” inProc. 33rd ACM Interna- tional Conference on Multimedia (ACM MM), 2025, pp. 343–352, doi: 10.1145/3746027.3755131

work page doi:10.1145/3746027.3755131 2025
[7]

Depth-guided cross- modal fusion network for underwater salient instance segmentation,

S. Zheng, X. Zhou, L. Bao, X. Hu, and J. Zhang, “Depth-guided cross- modal fusion network for underwater salient instance segmentation,” Symmetry, vol. 18, no. 5, p. 799, 2026, doi: 10.3390/sym18050799

work page doi:10.3390/sym18050799 2026
[8]

Depth Anything V2 , url =

L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,” inAdvances in Neural Information Processing Systems 37, 2024, pp. 21 875–21 911, doi: 10.52202/079017-0688

work page doi:10.52202/079017-0688 2024
[9]

Rf- detr: Neural architecture search for real-time detection transformers,

I. Robinson, P. Robicheaux, M. Popov, D. Ramanan, and N. Peri, “Rf- detr: Neural architecture search for real-time detection transformers,” inProc. International Conference on Learning Representations (ICLR), 2026, doi: 10.48550/arXiv.2511.09554

work page doi:10.48550/arxiv.2511.09554 2026
[10]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inComputer Vision – ECCV 2014, 2014, pp. 740–755, doi: 10.1007/978-3-319-10602-1 48

work page doi:10.1007/978-3-319-10602-1 2014
[11]

U-net: Convolutional networks for biomedical image segmentation

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inMedical Image Computing and Computer-Assisted Intervention – MICCAI 2015, 2015, pp. 234–241, doi: 10.1007/978-3-319-24574-4 28

work page doi:10.1007/978-3-319-24574-4 2015
[12]

Feature pyramid networks for object detection,

T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 936–944, doi: 10.1109/CVPR.2017.106

work page doi:10.1109/cvpr.2017.106 2017
[13]

Zavrtanik, M

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted win- dows,” inProc. IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9992–10 002, doi: 10.1109/ICCV48922.2021.00986

work page doi:10.1109/iccv48922.2021.00986 2021
[14]

Conditional convolutions for instance segmentation,

Z. Tian, C. Shen, and H. Chen, “Conditional convolutions for instance segmentation,” inComputer Vision – ECCV 2020, 2020, pp. 282–298, doi: 10.1007/978-3-030-58452-8 17

work page doi:10.1007/978-3-030-58452-8 2020
[15]

Zavrtanik, M

Y . Fang, S. Yang, X. Wang, Y . Li, C. Fang, Y . Shan, B. Feng, and W. Liu, “Instances as queries,” inProc. IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 6890–6899, doi: 10.1109/ICCV48922.2021.00683

work page doi:10.1109/iccv48922.2021.00683 2021
[17]

Cascade r-cnn: Delving into high qual- ity object detection,

Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high qual- ity object detection,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6154–6162, doi: 10.1109/CVPR.2018.00644

work page doi:10.1109/cvpr.2018.00644 2018
[18]

Boundary-preserving mask r-cnn,

T. Cheng, X. Wang, L. Huang, and W. Liu, “Boundary-preserving mask r-cnn,” inComputer Vision – ECCV 2020, 2020, pp. 660–676, doi: 10.1007/978-3-030-58568-6 39

work page doi:10.1007/978-3-030-58568-6 2020
[19]

Understanding the User Experience of Customer Service Chatbots: What Can We Learn from Customer Satisfaction Surveys?Chatbot Re- search and Design (CONVERSATIONS 2020)

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Computer Vision – ECCV 2020, 2020, pp. 213–229, doi: 10.1007/978- 3-030-58452-8 13

work page doi:10.1007/978- 2020
[20]

URL https://proceedings.mlr

Y . Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y . Liu, and J. Chen, “Detrs beat yolos on real-time object detection,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 16 965–16 974, doi: 10.1109/CVPR52733.2024.01605

work page doi:10.1109/cvpr52733.2024.01605 2024
[21]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Gird- har, “Masked-attention mask transformer for universal image seg- mentation,” inProc. IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2022, pp. 1280–1289, doi: 10.1109/CVPR52688.2022.00135

work page doi:10.1109/cvpr52688.2022.00135 2022
[22]

In: IEEE/CVF International Conference on Computer Vision

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Girshick, “Segment anything,” inProc. IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3992–4003, doi: 10.1109/ICCV51070.2023.00371

work page doi:10.1109/iccv51070.2023.00371 2023
[23]

URL https://proceedings.mlr

Y . Xiong, B. Varadarajan, L. Wu, X. Xiang, F. Xiao, C. Zhu, X. Dai, D. Wang, F. Sun, F. Iandola, R. Krishnamoorthi, and V . Chandra, “Efficientsam: Leveraged masked image pretraining for efficient seg- ment anything,” inProc. IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2024, pp. 16 111–16 121, doi: 10.1109/CVPR52733.2024.01525

work page doi:10.1109/cvpr52733.2024.01525 2024
[24]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J ´egou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features with...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.07193 2023
[25]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

D. Bolya, C. Zhou, F. Xiao, and Y . J. Lee, “Yolact: Real- time instance segmentation,” inProc. IEEE/CVF International Con- ference on Computer Vision (ICCV), 2019, pp. 9156–9165, doi: 10.1109/ICCV .2019.00925

work page doi:10.1109/iccv 2019
[26]

Solo: Segmenting objects by locations,

X. Wang, T. Kong, C. Shen, Y . Jiang, and L. Li, “Solo: Segmenting objects by locations,” inComputer Vision – ECCV 2020, 2020, pp. 649– 665, doi: 10.1007/978-3-030-58523-5 38

work page doi:10.1007/978-3-030-58523-5 2020
[27]

In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

A. Kirillov, Y . Wu, K. He, and R. Girshick, “Pointrend: Image seg- mentation as rendering,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9796–9805, doi: 10.1109/CVPR42600.2020.00982

work page doi:10.1109/cvpr42600.2020.00982 2020
[28]

Mask scoring r-cnn,

Z. Huang, L. Huang, Y . Gong, C. Huang, and X. Wang, “Mask scoring r-cnn,” inProc. IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2019, pp. 6402–6411, doi: 10.1109/CVPR.2019.00657

work page doi:10.1109/cvpr.2019.00657 2019
[29]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition

T. Cheng, X. Wang, S. Chen, W. Zhang, Q. Zhang, C. Huang, Z. Zhang, and W. Liu, “Sparse instance activation for real-time in- stance segmentation,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 4423–4432, doi: 10.1109/CVPR52688.2022.00439

work page doi:10.1109/cvpr52688.2022.00439 2022
[30]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition

Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11 966–11 976, doi: 10.1109/CVPR52688.2022.01167

work page doi:10.1109/cvpr52688.2022.01167 2022
[31]

The lovasz-softmax loss: A tractable surrogate for the optimization of the intersection-over- union measure in neural networks,

M. Berman, A. R. Triki, and M. B. Blaschko, “The lovasz-softmax loss: A tractable surrogate for the optimization of the intersection-over- union measure in neural networks,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4413– 4421, doi: 10.1109/CVPR.2018.00464

work page doi:10.1109/cvpr.2018.00464 2018
[32]

Bipa: Bilevel prompt adaptation for underwater instance segmentation,

L. Ma, H. Zheng, Y . Mao, J. Liu, C. Xu, X. Xue, Y . Wang, X. He, and W. Wang, “Bipa: Bilevel prompt adaptation for underwater instance segmentation,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 10 731–10 740

2026
[33]

Usis- pgm: Photometric gaussian mixtures for underwater salient in- stance segmentation,

L. Hong, X. Yao, M. Bozkurt, X. Wang, and F. Zhang, “Usis- pgm: Photometric gaussian mixtures for underwater salient in- stance segmentation,”arXiv preprint arXiv:2603.13961, 2026, doi: 10.48550/arXiv.2603.13961

work page doi:10.48550/arxiv.2603.13961 2026

[1] [1]

In: IEEE/CVF International Conference on Computer Vision

S. Lian, H. Li, R. Cong, S. Li, W. Zhang, and S. Kwong, “Watermask: Instance segmentation for underwater imagery,” inProc. IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 1305– 1315, doi: 10.1109/ICCV51070.2023.00126

work page doi:10.1109/iccv51070.2023.00126 2023

[2] [2]

Diving into underwater: Segment anything model guided underwater salient instance segmentation and a large-scale dataset,

S. Lian, Z. Zhang, H. Li, W. Li, L. T. Yang, S. Kwong, and R. Cong, “Diving into underwater: Segment anything model guided underwater salient instance segmentation and a large-scale dataset,” inProc. Inter- national Conference on Machine Learning (ICML), 2024, pp. 29 545– 29 559, doi: 10.48550/arXiv.2406.06039

work page doi:10.48550/arxiv.2406.06039 2024

[3] [3]

Advancing marine research: Uwsam framework and uiis10k dataset for precise underwater instance segmentation,

H. Li, S. Lian, Z. Li, R. Cong, C. Li, L. T. Yang, W. Zhang, and S. Kwong, “Advancing marine research: Uwsam framework and uiis10k dataset for precise underwater instance segmentation,”arXiv preprint arXiv:2505.15581, 2025, doi: 10.48550/arXiv.2505.15581

work page doi:10.48550/arxiv.2505.15581 2025

[4] [4]

Usis16k: High-quality dataset for underwater salient instance segmentation,

L. Hong, X. Wang, Y . Li, and X. Wang, “Usis16k: High-quality dataset for underwater salient instance segmentation,”arXiv preprint arXiv:2506.19472, 2025, doi: 10.48550/arXiv.2506.19472

work page doi:10.48550/arxiv.2506.19472 2025

[5] [5]

Empowering dino represen- tations for underwater instance segmentation via aligner and prompter,

Z. Chen, C. Zhang, H. Fang, and R. Cong, “Empowering dino represen- tations for underwater instance segmentation via aligner and prompter,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 5, pp. 3201–3209, 2026, doi: 10.1609/aaai.v40i5.37314

work page doi:10.1609/aaai.v40i5.37314 2026

[6] [6]

Uis-mamba: Exploring mamba for underwater instance segmentation via dynamic tree scan and hidden state weaken,

R. Cong, Z. Yu, H. Fang, H. Sun, and S. Kwong, “Uis-mamba: Exploring mamba for underwater instance segmentation via dynamic tree scan and hidden state weaken,” inProc. 33rd ACM Interna- tional Conference on Multimedia (ACM MM), 2025, pp. 343–352, doi: 10.1145/3746027.3755131

work page doi:10.1145/3746027.3755131 2025

[7] [7]

Depth-guided cross- modal fusion network for underwater salient instance segmentation,

S. Zheng, X. Zhou, L. Bao, X. Hu, and J. Zhang, “Depth-guided cross- modal fusion network for underwater salient instance segmentation,” Symmetry, vol. 18, no. 5, p. 799, 2026, doi: 10.3390/sym18050799

work page doi:10.3390/sym18050799 2026

[8] [8]

Depth Anything V2 , url =

L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,” inAdvances in Neural Information Processing Systems 37, 2024, pp. 21 875–21 911, doi: 10.52202/079017-0688

work page doi:10.52202/079017-0688 2024

[9] [9]

Rf- detr: Neural architecture search for real-time detection transformers,

I. Robinson, P. Robicheaux, M. Popov, D. Ramanan, and N. Peri, “Rf- detr: Neural architecture search for real-time detection transformers,” inProc. International Conference on Learning Representations (ICLR), 2026, doi: 10.48550/arXiv.2511.09554

work page doi:10.48550/arxiv.2511.09554 2026

[10] [10]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inComputer Vision – ECCV 2014, 2014, pp. 740–755, doi: 10.1007/978-3-319-10602-1 48

work page doi:10.1007/978-3-319-10602-1 2014

[11] [11]

U-net: Convolutional networks for biomedical image segmentation

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inMedical Image Computing and Computer-Assisted Intervention – MICCAI 2015, 2015, pp. 234–241, doi: 10.1007/978-3-319-24574-4 28

work page doi:10.1007/978-3-319-24574-4 2015

[12] [12]

Feature pyramid networks for object detection,

T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 936–944, doi: 10.1109/CVPR.2017.106

work page doi:10.1109/cvpr.2017.106 2017

[13] [13]

Zavrtanik, M

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted win- dows,” inProc. IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9992–10 002, doi: 10.1109/ICCV48922.2021.00986

work page doi:10.1109/iccv48922.2021.00986 2021

[14] [14]

Conditional convolutions for instance segmentation,

Z. Tian, C. Shen, and H. Chen, “Conditional convolutions for instance segmentation,” inComputer Vision – ECCV 2020, 2020, pp. 282–298, doi: 10.1007/978-3-030-58452-8 17

work page doi:10.1007/978-3-030-58452-8 2020

[15] [15]

Zavrtanik, M

Y . Fang, S. Yang, X. Wang, Y . Li, C. Fang, Y . Shan, B. Feng, and W. Liu, “Instances as queries,” inProc. IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 6890–6899, doi: 10.1109/ICCV48922.2021.00683

work page doi:10.1109/iccv48922.2021.00683 2021

[16] [17]

Cascade r-cnn: Delving into high qual- ity object detection,

Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high qual- ity object detection,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6154–6162, doi: 10.1109/CVPR.2018.00644

work page doi:10.1109/cvpr.2018.00644 2018

[17] [18]

Boundary-preserving mask r-cnn,

T. Cheng, X. Wang, L. Huang, and W. Liu, “Boundary-preserving mask r-cnn,” inComputer Vision – ECCV 2020, 2020, pp. 660–676, doi: 10.1007/978-3-030-58568-6 39

work page doi:10.1007/978-3-030-58568-6 2020

[18] [19]

Understanding the User Experience of Customer Service Chatbots: What Can We Learn from Customer Satisfaction Surveys?Chatbot Re- search and Design (CONVERSATIONS 2020)

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Computer Vision – ECCV 2020, 2020, pp. 213–229, doi: 10.1007/978- 3-030-58452-8 13

work page doi:10.1007/978- 2020

[19] [20]

URL https://proceedings.mlr

Y . Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y . Liu, and J. Chen, “Detrs beat yolos on real-time object detection,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 16 965–16 974, doi: 10.1109/CVPR52733.2024.01605

work page doi:10.1109/cvpr52733.2024.01605 2024

[20] [21]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Gird- har, “Masked-attention mask transformer for universal image seg- mentation,” inProc. IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2022, pp. 1280–1289, doi: 10.1109/CVPR52688.2022.00135

work page doi:10.1109/cvpr52688.2022.00135 2022

[21] [22]

In: IEEE/CVF International Conference on Computer Vision

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Girshick, “Segment anything,” inProc. IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3992–4003, doi: 10.1109/ICCV51070.2023.00371

work page doi:10.1109/iccv51070.2023.00371 2023

[22] [23]

URL https://proceedings.mlr

Y . Xiong, B. Varadarajan, L. Wu, X. Xiang, F. Xiao, C. Zhu, X. Dai, D. Wang, F. Sun, F. Iandola, R. Krishnamoorthi, and V . Chandra, “Efficientsam: Leveraged masked image pretraining for efficient seg- ment anything,” inProc. IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2024, pp. 16 111–16 121, doi: 10.1109/CVPR52733.2024.01525

work page doi:10.1109/cvpr52733.2024.01525 2024

[23] [24]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J ´egou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features with...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.07193 2023

[24] [25]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

D. Bolya, C. Zhou, F. Xiao, and Y . J. Lee, “Yolact: Real- time instance segmentation,” inProc. IEEE/CVF International Con- ference on Computer Vision (ICCV), 2019, pp. 9156–9165, doi: 10.1109/ICCV .2019.00925

work page doi:10.1109/iccv 2019

[25] [26]

Solo: Segmenting objects by locations,

X. Wang, T. Kong, C. Shen, Y . Jiang, and L. Li, “Solo: Segmenting objects by locations,” inComputer Vision – ECCV 2020, 2020, pp. 649– 665, doi: 10.1007/978-3-030-58523-5 38

work page doi:10.1007/978-3-030-58523-5 2020

[26] [27]

In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

A. Kirillov, Y . Wu, K. He, and R. Girshick, “Pointrend: Image seg- mentation as rendering,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9796–9805, doi: 10.1109/CVPR42600.2020.00982

work page doi:10.1109/cvpr42600.2020.00982 2020

[27] [28]

Mask scoring r-cnn,

Z. Huang, L. Huang, Y . Gong, C. Huang, and X. Wang, “Mask scoring r-cnn,” inProc. IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2019, pp. 6402–6411, doi: 10.1109/CVPR.2019.00657

work page doi:10.1109/cvpr.2019.00657 2019

[28] [29]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition

T. Cheng, X. Wang, S. Chen, W. Zhang, Q. Zhang, C. Huang, Z. Zhang, and W. Liu, “Sparse instance activation for real-time in- stance segmentation,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 4423–4432, doi: 10.1109/CVPR52688.2022.00439

work page doi:10.1109/cvpr52688.2022.00439 2022

[29] [30]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition

Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11 966–11 976, doi: 10.1109/CVPR52688.2022.01167

work page doi:10.1109/cvpr52688.2022.01167 2022

[30] [31]

The lovasz-softmax loss: A tractable surrogate for the optimization of the intersection-over- union measure in neural networks,

M. Berman, A. R. Triki, and M. B. Blaschko, “The lovasz-softmax loss: A tractable surrogate for the optimization of the intersection-over- union measure in neural networks,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4413– 4421, doi: 10.1109/CVPR.2018.00464

work page doi:10.1109/cvpr.2018.00464 2018

[31] [32]

Bipa: Bilevel prompt adaptation for underwater instance segmentation,

L. Ma, H. Zheng, Y . Mao, J. Liu, C. Xu, X. Xue, Y . Wang, X. He, and W. Wang, “Bipa: Bilevel prompt adaptation for underwater instance segmentation,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 10 731–10 740

2026

[32] [33]

Usis- pgm: Photometric gaussian mixtures for underwater salient in- stance segmentation,

L. Hong, X. Yao, M. Bozkurt, X. Wang, and F. Zhang, “Usis- pgm: Photometric gaussian mixtures for underwater salient in- stance segmentation,”arXiv preprint arXiv:2603.13961, 2026, doi: 10.48550/arXiv.2603.13961

work page doi:10.48550/arxiv.2603.13961 2026