pith. sign in

arxiv: 2506.01015 · v2 · pith:DJEFPVP2new · submitted 2025-06-01 · 💻 cs.CV

AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

Pith reviewed 2026-05-19 11:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords SAM2audio-visual segmentationpromptable segmentationcross-modal fusionfeature pyramidcontrastive lossvideo object segmentation
0
0 comments X

The pith

AuralSAM2 adds audio to SAM2 by propagating fused audio-visual prompts through the model's feature pyramid and an audio-guided contrastive loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AuralSAM2 to let SAM2 use audio as an additional prompt type for video segmentation tasks. It does this by creating a module called AuralFuser that combines audio and visual features into sparse and dense prompts, which then travel through the existing feature pyramid layers. An audio-guided contrastive loss is added to keep the visual features attentive to the audio signal. If this works, it would let users guide segmentation in videos using sound cues in interactive settings without needing to convert audio into boxes or slowing down the model much.

Core claim

AuralSAM2 integrates audio into SAM2 while largely preserving its promptable segmentation capability. Its core module, AuralFuser, fuses audio and visual features to generate sparse and dense prompts. Guided by audio and built upon SAM2's feature pyramid, these prompts propagate auditory cues across visual layers, reinforcing cross-modal influence. To further align modalities, an audio-guided contrastive loss emphasises auditory relevance in dominant visual features.

What carries the argument

AuralFuser, which fuses audio and visual features on top of SAM2's feature pyramid to produce audio-guided sparse and dense prompts that propagate cross-modal cues through the network layers.

If this is right

  • Notable accuracy gains on public audio-visual segmentation benchmarks.
  • Only minimal impact on the interactive efficiency of promptable segmentation.
  • Reduced audio prompt dilution compared to earlier adapter-based fusion methods.
  • Preserved ability to use visual prompts alone without performance loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on longer video sequences where audio cues might help maintain object identity across cuts or occlusions.
  • Similar pyramid fusion might improve other promptable models when adding sound or other non-visual signals.
  • It opens a path to segmentation systems that switch between visual, audio, or combined prompts depending on which modality is clearest in a given frame.

Load-bearing premise

The pyramid propagation of auditory cues and the audio-guided contrastive loss will reinforce cross-modal influence without causing audio prompt dilution or harming SAM2's original generalization on visual prompts.

What would settle it

Running the method on a standard audio-visual segmentation benchmark and finding either no measurable accuracy gain over baseline SAM2 or a clear drop in frames-per-second during interactive prompting.

Figures

Figures reproduced from arXiv: 2506.01015 by Can Peng, Chong Wang, Gustavo Carneiro, Jingkun Chen, Junde Wu, Junlin Han, Yuanhong Chen, Yu Tian, Yuyuan Liu.

Figure 1
Figure 1. Figure 1: Prompt Engineering for Integrating Audio Signals in AVSBench (V1m) [67]. SAM2 (AVS) includes re-implemented adapter-based methods GAVS [58] and SAMA-AVS [33], along with AL-REF [18], which process audio signals to segment sound￾ing objects. To simulate human-in-the-loop scenarios, SAM2 (Ensemble) combines the SAM2 (AVS) results with SAM2 out￾puts guided by visual prompts generated from ground truth. them, … view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of our approach in a language-aided AVS dataset [59]. Audio WAV and text sentences are processed via VGGish [5] and RoBERTa [36], respectively, and then combined. Visual features are extracted from SAM2 [51] in a pyramid structure and processed through PatchEmbedding in Eq. (1) with varying patch sizes (equivalent to the Lateral Layer when k=3), then merged using Eq. (4). The visual and audio-… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation Studies on missing modalities in Ref-AVS (Seen subset) [59] using Hiera l backbone, evaluating the impor￾tance of audio , language and visual modalities [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative visualisations on the Ref-AVS [ [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: ‘The object making a sound by being played by the woman.’ from Ref-AVS (seen) [ [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: ‘The object producing sound under the manipulation of the individual on the left.’ from Ref-AVS (seen) [ [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: ‘The object making the longest sound duration.’ from Ref-AVS (unseen) [ [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: ‘The object that keeps making sound at all times.’ (from Ref-AVS (unseen) [ [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: case (a) from AVSBench (V1s) [67] Frame Label GAVS [58] SAMA [33] Ours (b+) Ours (l) [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: case (b) from AVSBench (V1s) [67] 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: case (a) from AVSBench (V1m) [67] Frame Label GAVS [58] SAMA [33] Ours (b+) Ours (l) [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: case (b) from AVSBench (V1m) [67] 17 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: case (a) from AVSBench (V2) [68] Frame Label GAVS [58] SAMA [33] Ours (b+) Ours (l) [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: case (b) from AVSBench (V2) [68] 18 [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
read the original abstract

Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches either convert audio into visual prompts (e.g., boxes) via foundation models, or inject adapters into the image encoder for audio-visual fusion. Yet both directions fall short in human-in-the-loop scenarios due to limited prompt accuracy and increased inference overhead. In particular, these adapter-based methods often suffer from audio prompt dilution, where the signal gradually weakens as it propagates through the network. In this work, we propose AuralSAM2, which integrates audio into SAM2 while largely preserving its promptable segmentation capability. Its core module, AuralFuser, fuses audio and visual features to generate sparse and dense prompts. Guided by audio and built upon SAM2's feature pyramid, these prompts propagate auditory cues across visual layers, reinforcing cross-modal influence. To further align modalities, we introduce an audio-guided contrastive loss that emphasises auditory relevance in dominant visual features. Our method achieves notable accuracy gains on public benchmarks with only minimal impact on the interactive efficiency of promptable segmentation. Our code is available at https://github.com/yyliu01/AuralSAM2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces AuralSAM2, an extension of SAM2 for audio-visual promptable segmentation in video. It proposes the AuralFuser module that fuses audio and visual features to generate sparse and dense prompts propagating auditory cues through SAM2's feature pyramid, together with an audio-guided contrastive loss to align modalities. The central claims are notable accuracy gains on public benchmarks, minimal impact on interactive efficiency, and largely preserved generalization on visual prompts.

Significance. If the accuracy gains hold without degrading visual-prompt performance, the work would meaningfully advance audio integration into promptable video segmentation models by mitigating prompt dilution while retaining SAM2's interactive strengths. The public code release is a positive factor for reproducibility.

major comments (1)
  1. [Experiments] Experiments section: No ablation or side-by-side evaluation is reported comparing AuralSAM2 to unmodified SAM2 on standard visual-only prompt tasks (box/point prompts on SA-V or DAVIS). This directly undermines the claim that the pyramid propagation via AuralFuser and the contrastive loss 'largely preserve' SAM2's original generalization, as even small alterations to visual feature pathways could cause negative transfer.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'notable accuracy gains' is not accompanied by specific metrics, datasets, or baseline comparisons, making the headline claim harder to evaluate at a glance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: No ablation or side-by-side evaluation is reported comparing AuralSAM2 to unmodified SAM2 on standard visual-only prompt tasks (box/point prompts on SA-V or DAVIS). This directly undermines the claim that the pyramid propagation via AuralFuser and the contrastive loss 'largely preserve' SAM2's original generalization, as even small alterations to visual feature pathways could cause negative transfer.

    Authors: We acknowledge this point. To strengthen the evidence that our modifications largely preserve SAM2's generalization on visual prompts, we will include in the revised manuscript additional experiments that directly compare AuralSAM2 to the unmodified SAM2 using box and point prompts on the SA-V and DAVIS datasets. These evaluations will be performed in a visual-only setting to demonstrate the absence of negative transfer. revision: yes

Circularity Check

0 steps flagged

No circularity: novel AuralFuser and contrastive loss form independent derivation chain

full rationale

The paper introduces AuralFuser as a new module that fuses audio-visual features to generate sparse/dense prompts on SAM2's pyramid, plus an audio-guided contrastive loss for modality alignment. These are presented as architectural additions rather than quantities derived from or fitted to prior outputs by the same authors. Claims of accuracy gains with preserved promptable segmentation rest on benchmark evaluations and the explicit design choices for cross-modal propagation, without self-definitional reductions, fitted-input predictions, or load-bearing self-citations. The derivation is self-contained against external SAM2 baselines and public datasets.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The claim depends on the effectiveness of cross-modal fusion in the feature pyramid and the assumption that contrastive alignment improves auditory relevance without side effects; no major free parameters or invented physical entities are detailed beyond the new module itself.

free parameters (1)
  • contrastive loss weighting factor
    Hyperparameter balancing the audio-guided contrastive loss against the main segmentation objective.
axioms (1)
  • domain assumption SAM2 feature pyramid layers can effectively propagate auditory cues to reinforce cross-modal influence
    Invoked in the design of AuralFuser and prompt propagation across visual layers.
invented entities (1)
  • AuralFuser no independent evidence
    purpose: Module to fuse audio and visual features into sparse and dense prompts
    New component introduced to address audio prompt dilution.

pith-pipeline@v0.9.0 · 5778 in / 1242 out tokens · 40926 ms · 2026-05-19T11:11:23.071926+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding

    cs.CV 2026-05 unverdicted novelty 6.0

    A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

  2. [2]

    Multimodal machine learning: A survey and tax- onomy

    Tadas Baltru ˇsaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and tax- onomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018. 1

  3. [3]

    A survey on deep multimodal learning for com- puter vision: advances, trends, applications, and datasets

    Khaled Bayoudh, Raja Knani, Fayc ¸al Hamdaoui, and Abdel- latif Mtibaa. A survey on deep multimodal learning for com- puter vision: advances, trends, applications, and datasets. The Visual Computer, 38(8):2939–2970, 2022. 2

  4. [4]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2

  5. [5]

    Vggsound: A large-scale audio-visual dataset

    Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 721–725. IEEE, 2020. 4

  6. [6]

    Localizing visual sounds the hard way

    Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Na- grani, Andrea Vedaldi, and Andrew Zisserman. Localizing visual sounds the hard way. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 16867–16876, 2021. 3

  7. [7]

    Zero-shot au- dio source separation through query-based learning from weakly-labeled data

    Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Zero-shot au- dio source separation through query-based learning from weakly-labeled data. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4441–4449, 2022. 3

  8. [8]

    Unraveling in- stance associations: A closer look for audio-visual segmenta- tion

    Yuanhong Chen, Yuyuan Liu, Hu Wang, Fengbei Liu, Chong Wang, Helen Frazer, and Gustavo Carneiro. Unraveling in- stance associations: A closer look for audio-visual segmenta- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 26497–26507,

  9. [9]

    Ccstereo: Audio-visual contextual and contrastive learning for binaural audio generation

    Yuanhong Chen, Kazuki Shimada, Christian Simon, Yukara Ikemiya, Takashi Shibuya, and Yuki Mitsufuji. Ccstereo: Audio-visual contextual and contrastive learning for binaural audio generation. arXiv preprint arXiv:2501.02786, 2025. 3

  10. [10]

    Cpm: Class-conditional prompting ma- chine for audio-visual segmentation

    Yuanhong Chen, Chong Wang, Yuyuan Liu, Hu Wang, and Gustavo Carneiro. Cpm: Class-conditional prompting ma- chine for audio-visual segmentation. In European Confer- ence on Computer Vision, pages 438–456. Springer, 2025. 2, 3, 4, 5, 6, 12

  11. [11]

    2.5 d visual sound

    Ruohan Gao and Kristen Grauman. 2.5 d visual sound. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 324–333, 2019. 3

  12. [12]

    Co-separating sounds of visual objects

    Ruohan Gao and Kristen Grauman. Co-separating sounds of visual objects. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 3879–3888,

  13. [13]

    Avsegformer: Audio-visual segmentation with trans- former

    Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, and Tong Lu. Avsegformer: Audio-visual segmentation with trans- former. arXiv preprint arXiv:2307.01146, 2023. 6

  14. [14]

    Improving audio-visual seg- mentation with bidirectional generation

    Dawei Hao, Yuxin Mao, Bowen He, Xiaodong Han, Yuchao Dai, and Yiran Zhong. Improving audio-visual seg- mentation with bidirectional generation. arXiv preprint arXiv:2308.08288, 2023. 3, 6

  15. [15]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 16000– 16009, 2022. 2

  16. [16]

    Lever- aging hallucinations to reduce manual prompt dependency in promptable segmentation

    Jian Hu, Jiayi Lin, Junchi Yan, and Shaogang Gong. Lever- aging hallucinations to reduce manual prompt dependency in promptable segmentation. arXiv preprint arXiv:2408.15205,

  17. [17]

    Discovering sound- ing objects by audio queries for audio visual segmentation

    Shaofei Huang, Han Li, Yuqing Wang, Hongji Zhu, Jiao Dai, Jizhong Han, Wenge Rong, and Si Liu. Discovering sound- ing objects by audio queries for audio visual segmentation. arXiv preprint arXiv:2309.09501, 2023. 3

  18. [18]

    Unleashing the temporal-spatial reasoning capacity of gpt for training-free audio and language referenced video object segmentation

    Shaofei Huang, Rui Ling, Hongyu Li, Tianrui Hui, Zongheng Tang, Xiaoming Wei, Jizhong Han, and Si Liu. Unleashing the temporal-spatial reasoning capacity of gpt for training-free audio and language referenced video object segmentation. arXiv preprint arXiv:2408.15876, 2024. 1, 2, 3, 6, 7

  19. [19]

    Scaling up visual and vision-language representa- tion learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR,

  20. [20]

    Vi- sual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022. 2, 3

  21. [21]

    Supervised contrastive learning

    Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673,

  22. [22]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4015–4026, 2023. 1, 2, 6

  23. [23]

    Selm: Selective mechanism based audio-visual segmentation

    Jiaxu Li, Songsong Yu, Yifan Wang, Lijun Wang, and Huchuan Lu. Selm: Selective mechanism based audio-visual segmentation. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 3926–3935, 2024. 3, 6 9

  24. [24]

    Catr: Combinatorial-dependence audio-queried trans- former for audio-visual video segmentation

    Kexin Li, Zongxin Yang, Lei Chen, Yi Yang, and Jun Xun. Catr: Combinatorial-dependence audio-queried trans- former for audio-visual video segmentation. arXiv preprint arXiv:2309.09709, 2023. 3

  25. [25]

    Robust referring video object segmentation with cyclic structural consensus

    Xiang Li, Jinglu Wang, Xiaohao Xu, Xiao Li, Bhiksha Raj, and Yan Lu. Robust referring video object segmentation with cyclic structural consensus. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22236– 22245, 2023. 6

  26. [26]

    Cascade-clip: Cascaded vision-language embeddings alignment for zero-shot semantic segmentation

    Yunheng Li, ZhongYu Li, Quansheng Zeng, Qibin Hou, and Ming-Ming Cheng. Cascade-clip: Cascaded vision-language embeddings alignment for zero-shot semantic segmentation. arXiv preprint arXiv:2406.00670, 2024. 2

  27. [27]

    Feature pyra- mid networks for object detection

    Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- mid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2117–2125, 2017. 2, 4, 5

  28. [28]

    Vision transformers are parameter-efficient audio- visual learners

    Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, and Gedas Bertasius. Vision transformers are parameter-efficient audio- visual learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 2299– 2309, 2023. 3

  29. [29]

    Bavs: Bootstrapping audio- visual segmentation by integrating foundation knowledge

    Chen Liu, Peike Li, Hu Zhang, Lincheng Li, Zi Huang, Dadong Wang, and Xin Yu. Bavs: Bootstrapping audio- visual segmentation by integrating foundation knowledge. arXiv preprint arXiv:2308.10175, 2023. 6

  30. [30]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 1

  31. [31]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 1, 2

  32. [32]

    A Survey on Hallucination in Large Vision-Language Models

    Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiu- tian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253, 2024. 2

  33. [33]

    Annotation-free audio-visual segmentation

    Jinxiang Liu, Yu Wang, Chen Ju, Chaofan Ma, Ya Zhang, and Weidi Xie. Annotation-free audio-visual segmentation. In Proceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 5604–5614, 2024. 1, 2, 3, 6, 7, 8, 12, 13, 14, 15, 16, 17, 18

  34. [34]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision , pages 38–55. Springer, 2024. 2, 3

  35. [35]

    Separate anything you describe

    Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D Plumbley, and Wenwu Wang. Separate anything you describe. IEEE/ACM Transactions on Audio, Speech, and Language Processing ,

  36. [36]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 364, 2019. 4

  37. [37]

    Contrastive multimodal fu- sion with tupleinfonce

    Yunze Liu, Qingnan Fan, Shanghang Zhang, Hao Dong, Thomas Funkhouser, and Li Yi. Contrastive multimodal fu- sion with tupleinfonce. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 754–763,

  38. [38]

    Residual pattern learning for pixel-wise out-of-distribution detection in semantic segmentation

    Yuyuan Liu, Choubo Ding, Yu Tian, Guansong Pang, Vasileios Belagiannis, Ian Reid, and Gustavo Carneiro. Residual pattern learning for pixel-wise out-of-distribution detection in semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 1151–1161, 2023. 2, 6, 12

  39. [39]

    Ittakestwo: Leverag- ing peer representations for semi-supervised lidar semantic segmentation

    Yuyuan Liu, Yuanhong Chen, Hu Wang, Vasileios Belagian- nis, Ian Reid, and Gustavo Carneiro. Ittakestwo: Leverag- ing peer representations for semi-supervised lidar semantic segmentation. In European Conference on Computer Vision, pages 81–99. Springer, 2024. 2, 12

  40. [40]

    Decoupled Weight Decay Regularization

    I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 6, 12

  41. [41]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding. arXiv preprint arXiv:2403.05525,

  42. [42]

    Stepping stones: A progressive training strategy for audio- visual semantic segmentation

    Juncheng Ma, Peiwen Sun, Yaoting Wang, and Di Hu. Stepping stones: A progressive training strategy for audio- visual semantic segmentation. IEEE European Conference on Computer Vision (ECCV), 2024. 3, 6, 9, 12

  43. [43]

    Multimodal variational auto-encoder based audio-visual segmentation

    Yuxin Mao, Jing Zhang, Mochu Xiang, Yiran Zhong, and Yuchao Dai. Multimodal variational auto-encoder based audio-visual segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 954– 965, 2023. 3

  44. [44]

    A closer look at weakly- supervised audio-visual source localization

    Shentong Mo and Pedro Morgado. A closer look at weakly- supervised audio-visual source localization. arXiv preprint arXiv:2209.09634, 2022. 3

  45. [45]

    Localizing visual sounds the easy way

    Shentong Mo and Pedro Morgado. Localizing visual sounds the easy way. In Computer Vision–ECCV 2022: 17th Eu- ropean Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pages 218–234. Springer, 2022. 3

  46. [46]

    Weakly-supervised audio- visual segmentation

    Shentong Mo and Bhiksha Raj. Weakly-supervised audio- visual segmentation. Advances in Neural Information Pro- cessing Systems, 36:17208–17221, 2023. 3

  47. [47]

    arXiv preprint arXiv:2305.01836 (2023)

    Shentong Mo and Yapeng Tian. Av-sam: Segment any- thing model meets audio-visual localization and segmenta- tion. arXiv preprint arXiv:2305.01836, 2023. 2, 3

  48. [48]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. 2, 5, 12

  49. [49]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 2

  50. [50]

    Learning 10 transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning 10 transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2

  51. [51]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. 1, 2, 4, 5, 6, 8, 12, 13

  52. [52]

    Hi- era: A hierarchical vision transformer without the bells-and- whistles

    Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, et al. Hi- era: A hierarchical vision transformer without the bells-and- whistles. In International Conference on Machine Learning, pages 29441–29454. PMLR, 2023. 3, 4, 7

  53. [53]

    Extending segment anything model into au- ditory and temporal dimensions for audio-visual segmenta- tion

    Juhyeong Seon, Woobin Im, Sebin Lee, Jumin Lee, and Sung-Eui Yoon. Extending segment anything model into au- ditory and temporal dimensions for audio-visual segmenta- tion. arXiv preprint arXiv:2406.06163, 2024. 2, 3

  54. [54]

    Long-tail learning with foun- dation model: Heavy fine-tuning hurts

    Jiang-Xin Shi, Tong Wei, Zhi Zhou, Jie-Jing Shao, Xin- Yan Han, and Yu-Feng Li. Long-tail learning with foun- dation model: Heavy fine-tuning hurts. arXiv preprint arXiv:2309.10019, 2023. 2

  55. [55]

    Bioclip: A vision foundation model for the tree of life

    Samuel Stevens, Jiaman Wu, Matthew J Thompson, Eliza- beth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger- Wolf, et al. Bioclip: A vision foundation model for the tree of life. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 19412–19424,

  56. [56]

    Exploring cross-image pixel contrast for semantic segmentation

    Wenguan Wang, Tianfei Zhou, Fisher Yu, Jifeng Dai, En- der Konukoglu, and Luc Van Gool. Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7303–7313, 2021. 2, 3, 6, 12

  57. [57]

    Pvt v2: Improved baselines with pyramid vision transformer

    Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022. 12

  58. [58]

    Prompting segmentation with sound is gen- eralizable audio-visual source localizer

    Yaoting Wang, Weisong Liu, Guangyao Li, Jian Ding, Di Hu, and Xi Li. Prompting segmentation with sound is gen- eralizable audio-visual source localizer. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5669– 5677, 2024. 1, 2, 3, 6, 7, 8, 12, 13, 14, 15, 16, 17, 18

  59. [59]

    Ref-avs: Refer and segment objects in audio-visual scenes

    Yaoting Wang, Peiwen Sun, Dongzhan Zhou, Guangyao Li, Honggang Zhang, and Di Hu. Ref-avs: Refer and segment objects in audio-visual scenes. In European Conference on Computer Vision, pages 196–213. Springer, 2025. 1, 3, 4, 5, 6, 7, 8, 12, 13, 14, 15

  60. [60]

    Language as queries for referring video object seg- mentation

    Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. Language as queries for referring video object seg- mentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4974– 4984, 2022. 6

  61. [61]

    Multimodal learning with transformers: A survey

    Peng Xu, Xiatian Zhu, and David A Clifton. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence , 45(10):12113– 12132, 2023. 1

  62. [62]

    Visually informed binaural au- dio generation without binaural audios

    Xudong Xu, Hang Zhou, Ziwei Liu, Bo Dai, Xiaogang Wang, and Dahua Lin. Visually informed binaural au- dio generation without binaural audios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15485–15494, 2021. 3

  63. [63]

    Cooperation does matter: Exploring multi-order bilateral relations for audio- visual segmentation, 2023

    Qi Yang, Xing Nie, Tong Li, Pengfei Gao, Ying Guo, Cheng Zhen, Pengfei Yan, and Shiming Xiang. Cooperation does matter: Exploring multi-order bilateral relations for audio- visual segmentation, 2023. 2, 6

  64. [64]

    How can contrastive pre- training benefit audio-visual segmentation? a study from su- pervised and zero-shot perspectives

    Jiarui Yu, Haoran Li, Yanbin Hao, Jinmeng Wu, Tong Xu, Shuo Wang, and Xiangnan He. How can contrastive pre- training benefit audio-visual segmentation? a study from su- pervised and zero-shot perspectives. In BMVC, pages 367– 374, 2023. 2, 3, 6

  65. [65]

    Mul- timodal contrastive training for visual representation learn- ing

    Xin Yuan, Zhe Lin, Jason Kuen, Jianming Zhang, Yilin Wang, Michael Maire, Ajinkya Kale, and Baldo Faieta. Mul- timodal contrastive training for visual representation learn- ing. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 6995–7004,

  66. [66]

    Pyramid scene parsing network

    Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 5

  67. [67]

    Audio–visual segmentation

    Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong. Audio–visual segmentation. In Computer Vision–ECCV 2022: 17th European Confer- ence, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pages 386–403. Springer, 2022. 1, 2, 3, 5, 6, 7, 8, 12, 13, 16, 17

  68. [68]

    Audio-visual segmentation with semantics

    Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Ling- peng Kong, Meng Wang, et al. Audio-visual segmentation with semantics. arXiv preprint arXiv:2301.13190, 2023. 1, 3, 6, 7, 12, 13, 18

  69. [69]

    Deep audio-visual learning: A survey

    Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He. Deep audio-visual learning: A survey. Interna- tional Journal of Automation and Computing , 18(3):351– 376, 2021. 3 11 AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting (Supplementary Material)

  70. [70]

    More Implementation Details 6.1. Hyper-parameter Configuration Our method is based on SAM2 [51], utilizing the Hi- era base+ and Hiera large backbones within the PyTorch framework, both of them remain frozen during training. We employ a batch size of one, where each batch consists of 5 frames for the V1s and V1m subsets in A VSBench [67], and 10 frames fo...

  71. [71]

    In the first four rows, we report the visual prompt results for SAM2, including four uniformly gener- ated points and boxes derived from the ground truth mask

    Prompting Engineering We provide additional details on prompt engineering based on the Hiera base+ backbone in A VSBench (V1m)[67], as shown in Tab.7. In the first four rows, we report the visual prompt results for SAM2, including four uniformly gener- ated points and boxes derived from the ground truth mask. Since pixel-level labeled masks are challengin...

  72. [72]

    Visualisations In this section, we present qualitative visualization re- sults comparing our method with other adapter-based ap- proaches, GA VS [58] and SAMA-A VS [33]. Specifically, Figures 5 and 6 illustrate the outputs in multimodal scenar- ios involving audio, language, and visual modalities within the Ref-A VS (seen)[59] subset, while Figures 7 and ...