AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

Can Peng; Chong Wang; Gustavo Carneiro; Jingkun Chen; Junde Wu; Junlin Han; Yuanhong Chen; Yu Tian; Yuyuan Liu

arxiv: 2506.01015 · v2 · pith:DJEFPVP2new · submitted 2025-06-01 · 💻 cs.CV

AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

Yuyuan Liu , Yuanhong Chen , Chong Wang , Junlin Han , Junde Wu , Can Peng , Jingkun Chen , Yu Tian

show 1 more author

Gustavo Carneiro

This is my paper

Pith reviewed 2026-05-19 11:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords SAM2audio-visual segmentationpromptable segmentationcross-modal fusionfeature pyramidcontrastive lossvideo object segmentation

0 comments

The pith

AuralSAM2 adds audio to SAM2 by propagating fused audio-visual prompts through the model's feature pyramid and an audio-guided contrastive loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AuralSAM2 to let SAM2 use audio as an additional prompt type for video segmentation tasks. It does this by creating a module called AuralFuser that combines audio and visual features into sparse and dense prompts, which then travel through the existing feature pyramid layers. An audio-guided contrastive loss is added to keep the visual features attentive to the audio signal. If this works, it would let users guide segmentation in videos using sound cues in interactive settings without needing to convert audio into boxes or slowing down the model much.

Core claim

AuralSAM2 integrates audio into SAM2 while largely preserving its promptable segmentation capability. Its core module, AuralFuser, fuses audio and visual features to generate sparse and dense prompts. Guided by audio and built upon SAM2's feature pyramid, these prompts propagate auditory cues across visual layers, reinforcing cross-modal influence. To further align modalities, an audio-guided contrastive loss emphasises auditory relevance in dominant visual features.

What carries the argument

AuralFuser, which fuses audio and visual features on top of SAM2's feature pyramid to produce audio-guided sparse and dense prompts that propagate cross-modal cues through the network layers.

If this is right

Notable accuracy gains on public audio-visual segmentation benchmarks.
Only minimal impact on the interactive efficiency of promptable segmentation.
Reduced audio prompt dilution compared to earlier adapter-based fusion methods.
Preserved ability to use visual prompts alone without performance loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on longer video sequences where audio cues might help maintain object identity across cuts or occlusions.
Similar pyramid fusion might improve other promptable models when adding sound or other non-visual signals.
It opens a path to segmentation systems that switch between visual, audio, or combined prompts depending on which modality is clearest in a given frame.

Load-bearing premise

The pyramid propagation of auditory cues and the audio-guided contrastive loss will reinforce cross-modal influence without causing audio prompt dilution or harming SAM2's original generalization on visual prompts.

What would settle it

Running the method on a standard audio-visual segmentation benchmark and finding either no measurable accuracy gain over baseline SAM2 or a clear drop in frames-per-second during interactive prompting.

Figures

Figures reproduced from arXiv: 2506.01015 by Can Peng, Chong Wang, Gustavo Carneiro, Jingkun Chen, Junde Wu, Junlin Han, Yuanhong Chen, Yu Tian, Yuyuan Liu.

**Figure 1.** Figure 1: Prompt Engineering for Integrating Audio Signals in AVSBench (V1m) [67]. SAM2 (AVS) includes re-implemented adapter-based methods GAVS [58] and SAMA-AVS [33], along with AL-REF [18], which process audio signals to segment sounding objects. To simulate human-in-the-loop scenarios, SAM2 (Ensemble) combines the SAM2 (AVS) results with SAM2 outputs guided by visual prompts generated from ground truth. them, … view at source ↗

**Figure 2.** Figure 2: Illustration of our approach in a language-aided AVS dataset [59]. Audio WAV and text sentences are processed via VGGish [5] and RoBERTa [36], respectively, and then combined. Visual features are extracted from SAM2 [51] in a pyramid structure and processed through PatchEmbedding in Eq. (1) with varying patch sizes (equivalent to the Lateral Layer when k=3), then merged using Eq. (4). The visual and audio-… view at source ↗

**Figure 3.** Figure 3: Ablation Studies on missing modalities in Ref-AVS (Seen subset) [59] using Hiera l backbone, evaluating the importance of audio , language and visual modalities [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative visualisations on the Ref-AVS [ [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: ‘The object making a sound by being played by the woman.’ from Ref-AVS (seen) [ [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: ‘The object producing sound under the manipulation of the individual on the left.’ from Ref-AVS (seen) [ [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: ‘The object making the longest sound duration.’ from Ref-AVS (unseen) [ [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: ‘The object that keeps making sound at all times.’ (from Ref-AVS (unseen) [ [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: case (a) from AVSBench (V1s) [67] Frame Label GAVS [58] SAMA [33] Ours (b+) Ours (l) [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: case (b) from AVSBench (V1s) [67] 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: case (a) from AVSBench (V1m) [67] Frame Label GAVS [58] SAMA [33] Ours (b+) Ours (l) [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: case (b) from AVSBench (V1m) [67] 17 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: case (a) from AVSBench (V2) [68] Frame Label GAVS [58] SAMA [33] Ours (b+) Ours (l) [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: case (b) from AVSBench (V2) [68] 18 [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

read the original abstract

Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches either convert audio into visual prompts (e.g., boxes) via foundation models, or inject adapters into the image encoder for audio-visual fusion. Yet both directions fall short in human-in-the-loop scenarios due to limited prompt accuracy and increased inference overhead. In particular, these adapter-based methods often suffer from audio prompt dilution, where the signal gradually weakens as it propagates through the network. In this work, we propose AuralSAM2, which integrates audio into SAM2 while largely preserving its promptable segmentation capability. Its core module, AuralFuser, fuses audio and visual features to generate sparse and dense prompts. Guided by audio and built upon SAM2's feature pyramid, these prompts propagate auditory cues across visual layers, reinforcing cross-modal influence. To further align modalities, we introduce an audio-guided contrastive loss that emphasises auditory relevance in dominant visual features. Our method achieves notable accuracy gains on public benchmarks with only minimal impact on the interactive efficiency of promptable segmentation. Our code is available at https://github.com/yyliu01/AuralSAM2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AuralSAM2 adds audio to SAM2 via pyramid fusion and contrastive loss but skips direct tests on whether visual-only prompt performance stays intact.

read the letter

This paper's main move is to bolt audio onto SAM2 with a new AuralFuser module that pulls audio and visual features together in the pyramid to make sparse and dense prompts, plus an audio-guided contrastive loss to keep the modalities aligned. It targets the dilution issue in adapter approaches and the overhead of turning audio into visual prompts first, while claiming the changes keep interactive speed close to the original model and deliver accuracy gains on benchmarks. Code release helps here for anyone wanting to inspect the implementation.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces AuralSAM2, an extension of SAM2 for audio-visual promptable segmentation in video. It proposes the AuralFuser module that fuses audio and visual features to generate sparse and dense prompts propagating auditory cues through SAM2's feature pyramid, together with an audio-guided contrastive loss to align modalities. The central claims are notable accuracy gains on public benchmarks, minimal impact on interactive efficiency, and largely preserved generalization on visual prompts.

Significance. If the accuracy gains hold without degrading visual-prompt performance, the work would meaningfully advance audio integration into promptable video segmentation models by mitigating prompt dilution while retaining SAM2's interactive strengths. The public code release is a positive factor for reproducibility.

major comments (1)

[Experiments] Experiments section: No ablation or side-by-side evaluation is reported comparing AuralSAM2 to unmodified SAM2 on standard visual-only prompt tasks (box/point prompts on SA-V or DAVIS). This directly undermines the claim that the pyramid propagation via AuralFuser and the contrastive loss 'largely preserve' SAM2's original generalization, as even small alterations to visual feature pathways could cause negative transfer.

minor comments (1)

[Abstract] Abstract: The phrase 'notable accuracy gains' is not accompanied by specific metrics, datasets, or baseline comparisons, making the headline claim harder to evaluate at a glance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address the major comment point by point below.

read point-by-point responses

Referee: [Experiments] Experiments section: No ablation or side-by-side evaluation is reported comparing AuralSAM2 to unmodified SAM2 on standard visual-only prompt tasks (box/point prompts on SA-V or DAVIS). This directly undermines the claim that the pyramid propagation via AuralFuser and the contrastive loss 'largely preserve' SAM2's original generalization, as even small alterations to visual feature pathways could cause negative transfer.

Authors: We acknowledge this point. To strengthen the evidence that our modifications largely preserve SAM2's generalization on visual prompts, we will include in the revised manuscript additional experiments that directly compare AuralSAM2 to the unmodified SAM2 using box and point prompts on the SA-V and DAVIS datasets. These evaluations will be performed in a visual-only setting to demonstrate the absence of negative transfer. revision: yes

Circularity Check

0 steps flagged

No circularity: novel AuralFuser and contrastive loss form independent derivation chain

full rationale

The paper introduces AuralFuser as a new module that fuses audio-visual features to generate sparse/dense prompts on SAM2's pyramid, plus an audio-guided contrastive loss for modality alignment. These are presented as architectural additions rather than quantities derived from or fitted to prior outputs by the same authors. Claims of accuracy gains with preserved promptable segmentation rest on benchmark evaluations and the explicit design choices for cross-modal propagation, without self-definitional reductions, fitted-input predictions, or load-bearing self-citations. The derivation is self-contained against external SAM2 baselines and public datasets.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The claim depends on the effectiveness of cross-modal fusion in the feature pyramid and the assumption that contrastive alignment improves auditory relevance without side effects; no major free parameters or invented physical entities are detailed beyond the new module itself.

free parameters (1)

contrastive loss weighting factor
Hyperparameter balancing the audio-guided contrastive loss against the main segmentation objective.

axioms (1)

domain assumption SAM2 feature pyramid layers can effectively propagate auditory cues to reinforce cross-modal influence
Invoked in the design of AuralFuser and prompt propagation across visual layers.

invented entities (1)

AuralFuser no independent evidence
purpose: Module to fuse audio and visual features into sparse and dense prompts
New component introduced to address audio prompt dilution.

pith-pipeline@v0.9.0 · 5778 in / 1242 out tokens · 40926 ms · 2026-05-19T11:11:23.071926+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AuralFuser ... fuses audio and visual features to generate sparse and dense prompts ... audio-guided contrastive loss
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

feature pyramid ... multi-scale feature fusion

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding
cs.CV 2026-05 unverdicted novelty 6.0

A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Multimodal machine learning: A survey and tax- onomy

Tadas Baltru ˇsaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and tax- onomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018. 1

work page 2018
[3]

A survey on deep multimodal learning for com- puter vision: advances, trends, applications, and datasets

Khaled Bayoudh, Raja Knani, Fayc ¸al Hamdaoui, and Abdel- latif Mtibaa. A survey on deep multimodal learning for com- puter vision: advances, trends, applications, and datasets. The Visual Computer, 38(8):2939–2970, 2022. 2

work page 2022
[4]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2

work page 2021
[5]

Vggsound: A large-scale audio-visual dataset

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 721–725. IEEE, 2020. 4

work page 2020
[6]

Localizing visual sounds the hard way

Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Na- grani, Andrea Vedaldi, and Andrew Zisserman. Localizing visual sounds the hard way. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 16867–16876, 2021. 3

work page 2021
[7]

Zero-shot au- dio source separation through query-based learning from weakly-labeled data

Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Zero-shot au- dio source separation through query-based learning from weakly-labeled data. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4441–4449, 2022. 3

work page 2022
[8]

Unraveling in- stance associations: A closer look for audio-visual segmenta- tion

Yuanhong Chen, Yuyuan Liu, Hu Wang, Fengbei Liu, Chong Wang, Helen Frazer, and Gustavo Carneiro. Unraveling in- stance associations: A closer look for audio-visual segmenta- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 26497–26507,

work page
[9]

Ccstereo: Audio-visual contextual and contrastive learning for binaural audio generation

Yuanhong Chen, Kazuki Shimada, Christian Simon, Yukara Ikemiya, Takashi Shibuya, and Yuki Mitsufuji. Ccstereo: Audio-visual contextual and contrastive learning for binaural audio generation. arXiv preprint arXiv:2501.02786, 2025. 3

work page arXiv 2025
[10]

Cpm: Class-conditional prompting ma- chine for audio-visual segmentation

Yuanhong Chen, Chong Wang, Yuyuan Liu, Hu Wang, and Gustavo Carneiro. Cpm: Class-conditional prompting ma- chine for audio-visual segmentation. In European Confer- ence on Computer Vision, pages 438–456. Springer, 2025. 2, 3, 4, 5, 6, 12

work page 2025
[11]

2.5 d visual sound

Ruohan Gao and Kristen Grauman. 2.5 d visual sound. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 324–333, 2019. 3

work page 2019
[12]

Co-separating sounds of visual objects

Ruohan Gao and Kristen Grauman. Co-separating sounds of visual objects. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 3879–3888,

work page
[13]

Avsegformer: Audio-visual segmentation with trans- former

Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, and Tong Lu. Avsegformer: Audio-visual segmentation with trans- former. arXiv preprint arXiv:2307.01146, 2023. 6

work page arXiv 2023
[14]

Improving audio-visual seg- mentation with bidirectional generation

Dawei Hao, Yuxin Mao, Bowen He, Xiaodong Han, Yuchao Dai, and Yiran Zhong. Improving audio-visual seg- mentation with bidirectional generation. arXiv preprint arXiv:2308.08288, 2023. 3, 6

work page arXiv 2023
[15]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 16000– 16009, 2022. 2

work page 2022
[16]

Lever- aging hallucinations to reduce manual prompt dependency in promptable segmentation

Jian Hu, Jiayi Lin, Junchi Yan, and Shaogang Gong. Lever- aging hallucinations to reduce manual prompt dependency in promptable segmentation. arXiv preprint arXiv:2408.15205,

work page arXiv
[17]

Discovering sound- ing objects by audio queries for audio visual segmentation

Shaofei Huang, Han Li, Yuqing Wang, Hongji Zhu, Jiao Dai, Jizhong Han, Wenge Rong, and Si Liu. Discovering sound- ing objects by audio queries for audio visual segmentation. arXiv preprint arXiv:2309.09501, 2023. 3

work page arXiv 2023
[18]

Unleashing the temporal-spatial reasoning capacity of gpt for training-free audio and language referenced video object segmentation

Shaofei Huang, Rui Ling, Hongyu Li, Tianrui Hui, Zongheng Tang, Xiaoming Wei, Jizhong Han, and Si Liu. Unleashing the temporal-spatial reasoning capacity of gpt for training-free audio and language referenced video object segmentation. arXiv preprint arXiv:2408.15876, 2024. 1, 2, 3, 6, 7

work page arXiv 2024
[19]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR,

work page
[20]

Vi- sual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022. 2, 3

work page 2022
[21]

Supervised contrastive learning

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673,

work page
[22]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4015–4026, 2023. 1, 2, 6

work page 2023
[23]

Selm: Selective mechanism based audio-visual segmentation

Jiaxu Li, Songsong Yu, Yifan Wang, Lijun Wang, and Huchuan Lu. Selm: Selective mechanism based audio-visual segmentation. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 3926–3935, 2024. 3, 6 9

work page 2024
[24]

Catr: Combinatorial-dependence audio-queried trans- former for audio-visual video segmentation

Kexin Li, Zongxin Yang, Lei Chen, Yi Yang, and Jun Xun. Catr: Combinatorial-dependence audio-queried trans- former for audio-visual video segmentation. arXiv preprint arXiv:2309.09709, 2023. 3

work page arXiv 2023
[25]

Robust referring video object segmentation with cyclic structural consensus

Xiang Li, Jinglu Wang, Xiaohao Xu, Xiao Li, Bhiksha Raj, and Yan Lu. Robust referring video object segmentation with cyclic structural consensus. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22236– 22245, 2023. 6

work page 2023
[26]

Cascade-clip: Cascaded vision-language embeddings alignment for zero-shot semantic segmentation

Yunheng Li, ZhongYu Li, Quansheng Zeng, Qibin Hou, and Ming-Ming Cheng. Cascade-clip: Cascaded vision-language embeddings alignment for zero-shot semantic segmentation. arXiv preprint arXiv:2406.00670, 2024. 2

work page arXiv 2024
[27]

Feature pyra- mid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- mid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2117–2125, 2017. 2, 4, 5

work page 2017
[28]

Vision transformers are parameter-efficient audio- visual learners

Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, and Gedas Bertasius. Vision transformers are parameter-efficient audio- visual learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 2299– 2309, 2023. 3

work page 2023
[29]

Bavs: Bootstrapping audio- visual segmentation by integrating foundation knowledge

Chen Liu, Peike Li, Hu Zhang, Lincheng Li, Zi Huang, Dadong Wang, and Xin Yu. Bavs: Bootstrapping audio- visual segmentation by integrating foundation knowledge. arXiv preprint arXiv:2308.10175, 2023. 6

work page arXiv 2023
[30]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 1

work page 2024
[31]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 1, 2

work page 2024
[32]

A Survey on Hallucination in Large Vision-Language Models

Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiu- tian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Annotation-free audio-visual segmentation

Jinxiang Liu, Yu Wang, Chen Ju, Chaofan Ma, Ya Zhang, and Weidi Xie. Annotation-free audio-visual segmentation. In Proceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 5604–5614, 2024. 1, 2, 3, 6, 7, 8, 12, 13, 14, 15, 16, 17, 18

work page 2024
[34]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision , pages 38–55. Springer, 2024. 2, 3

work page 2024
[35]

Separate anything you describe

Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D Plumbley, and Wenwu Wang. Separate anything you describe. IEEE/ACM Transactions on Audio, Speech, and Language Processing ,

work page
[36]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 364, 2019. 4

work page internal anchor Pith review Pith/arXiv arXiv 1907
[37]

Contrastive multimodal fu- sion with tupleinfonce

Yunze Liu, Qingnan Fan, Shanghang Zhang, Hao Dong, Thomas Funkhouser, and Li Yi. Contrastive multimodal fu- sion with tupleinfonce. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 754–763,

work page
[38]

Residual pattern learning for pixel-wise out-of-distribution detection in semantic segmentation

Yuyuan Liu, Choubo Ding, Yu Tian, Guansong Pang, Vasileios Belagiannis, Ian Reid, and Gustavo Carneiro. Residual pattern learning for pixel-wise out-of-distribution detection in semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 1151–1161, 2023. 2, 6, 12

work page 2023
[39]

Ittakestwo: Leverag- ing peer representations for semi-supervised lidar semantic segmentation

Yuyuan Liu, Yuanhong Chen, Hu Wang, Vasileios Belagian- nis, Ian Reid, and Gustavo Carneiro. Ittakestwo: Leverag- ing peer representations for semi-supervised lidar semantic segmentation. In European Conference on Computer Vision, pages 81–99. Springer, 2024. 2, 12

work page 2024
[40]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 6, 12

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding. arXiv preprint arXiv:2403.05525,

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Stepping stones: A progressive training strategy for audio- visual semantic segmentation

Juncheng Ma, Peiwen Sun, Yaoting Wang, and Di Hu. Stepping stones: A progressive training strategy for audio- visual semantic segmentation. IEEE European Conference on Computer Vision (ECCV), 2024. 3, 6, 9, 12

work page 2024
[43]

Multimodal variational auto-encoder based audio-visual segmentation

Yuxin Mao, Jing Zhang, Mochu Xiang, Yiran Zhong, and Yuchao Dai. Multimodal variational auto-encoder based audio-visual segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 954– 965, 2023. 3

work page 2023
[44]

A closer look at weakly- supervised audio-visual source localization

Shentong Mo and Pedro Morgado. A closer look at weakly- supervised audio-visual source localization. arXiv preprint arXiv:2209.09634, 2022. 3

work page arXiv 2022
[45]

Localizing visual sounds the easy way

Shentong Mo and Pedro Morgado. Localizing visual sounds the easy way. In Computer Vision–ECCV 2022: 17th Eu- ropean Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pages 218–234. Springer, 2022. 3

work page 2022
[46]

Weakly-supervised audio- visual segmentation

Shentong Mo and Bhiksha Raj. Weakly-supervised audio- visual segmentation. Advances in Neural Information Pro- cessing Systems, 36:17208–17221, 2023. 3

work page 2023
[47]

arXiv preprint arXiv:2305.01836 (2023)

Shentong Mo and Yapeng Tian. Av-sam: Segment any- thing model meets audio-visual localization and segmenta- tion. arXiv preprint arXiv:2305.01836, 2023. 2, 3

work page arXiv 2023
[48]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. 2, 5, 12

work page internal anchor Pith review Pith/arXiv arXiv 2018
[49]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Learning 10 transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning 10 transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2

work page 2021
[51]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. 1, 2, 4, 5, 6, 8, 12, 13

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Hi- era: A hierarchical vision transformer without the bells-and- whistles

Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, et al. Hi- era: A hierarchical vision transformer without the bells-and- whistles. In International Conference on Machine Learning, pages 29441–29454. PMLR, 2023. 3, 4, 7

work page 2023
[53]

Extending segment anything model into au- ditory and temporal dimensions for audio-visual segmenta- tion

Juhyeong Seon, Woobin Im, Sebin Lee, Jumin Lee, and Sung-Eui Yoon. Extending segment anything model into au- ditory and temporal dimensions for audio-visual segmenta- tion. arXiv preprint arXiv:2406.06163, 2024. 2, 3

work page arXiv 2024
[54]

Long-tail learning with foun- dation model: Heavy fine-tuning hurts

Jiang-Xin Shi, Tong Wei, Zhi Zhou, Jie-Jing Shao, Xin- Yan Han, and Yu-Feng Li. Long-tail learning with foun- dation model: Heavy fine-tuning hurts. arXiv preprint arXiv:2309.10019, 2023. 2

work page arXiv 2023
[55]

Bioclip: A vision foundation model for the tree of life

Samuel Stevens, Jiaman Wu, Matthew J Thompson, Eliza- beth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger- Wolf, et al. Bioclip: A vision foundation model for the tree of life. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 19412–19424,

work page
[56]

Exploring cross-image pixel contrast for semantic segmentation

Wenguan Wang, Tianfei Zhou, Fisher Yu, Jifeng Dai, En- der Konukoglu, and Luc Van Gool. Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7303–7313, 2021. 2, 3, 6, 12

work page 2021
[57]

Pvt v2: Improved baselines with pyramid vision transformer

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022. 12

work page 2022
[58]

Prompting segmentation with sound is gen- eralizable audio-visual source localizer

Yaoting Wang, Weisong Liu, Guangyao Li, Jian Ding, Di Hu, and Xi Li. Prompting segmentation with sound is gen- eralizable audio-visual source localizer. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5669– 5677, 2024. 1, 2, 3, 6, 7, 8, 12, 13, 14, 15, 16, 17, 18

work page 2024
[59]

Ref-avs: Refer and segment objects in audio-visual scenes

Yaoting Wang, Peiwen Sun, Dongzhan Zhou, Guangyao Li, Honggang Zhang, and Di Hu. Ref-avs: Refer and segment objects in audio-visual scenes. In European Conference on Computer Vision, pages 196–213. Springer, 2025. 1, 3, 4, 5, 6, 7, 8, 12, 13, 14, 15

work page 2025
[60]

Language as queries for referring video object seg- mentation

Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. Language as queries for referring video object seg- mentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4974– 4984, 2022. 6

work page 2022
[61]

Multimodal learning with transformers: A survey

Peng Xu, Xiatian Zhu, and David A Clifton. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence , 45(10):12113– 12132, 2023. 1

work page 2023
[62]

Visually informed binaural au- dio generation without binaural audios

Xudong Xu, Hang Zhou, Ziwei Liu, Bo Dai, Xiaogang Wang, and Dahua Lin. Visually informed binaural au- dio generation without binaural audios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15485–15494, 2021. 3

work page 2021
[63]

Cooperation does matter: Exploring multi-order bilateral relations for audio- visual segmentation, 2023

Qi Yang, Xing Nie, Tong Li, Pengfei Gao, Ying Guo, Cheng Zhen, Pengfei Yan, and Shiming Xiang. Cooperation does matter: Exploring multi-order bilateral relations for audio- visual segmentation, 2023. 2, 6

work page 2023
[64]

How can contrastive pre- training benefit audio-visual segmentation? a study from su- pervised and zero-shot perspectives

Jiarui Yu, Haoran Li, Yanbin Hao, Jinmeng Wu, Tong Xu, Shuo Wang, and Xiangnan He. How can contrastive pre- training benefit audio-visual segmentation? a study from su- pervised and zero-shot perspectives. In BMVC, pages 367– 374, 2023. 2, 3, 6

work page 2023
[65]

Mul- timodal contrastive training for visual representation learn- ing

Xin Yuan, Zhe Lin, Jason Kuen, Jianming Zhang, Yilin Wang, Michael Maire, Ajinkya Kale, and Baldo Faieta. Mul- timodal contrastive training for visual representation learn- ing. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 6995–7004,

work page
[66]

Pyramid scene parsing network

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 5

work page 2017
[67]

Audio–visual segmentation

Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong. Audio–visual segmentation. In Computer Vision–ECCV 2022: 17th European Confer- ence, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pages 386–403. Springer, 2022. 1, 2, 3, 5, 6, 7, 8, 12, 13, 16, 17

work page 2022
[68]

Audio-visual segmentation with semantics

Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Ling- peng Kong, Meng Wang, et al. Audio-visual segmentation with semantics. arXiv preprint arXiv:2301.13190, 2023. 1, 3, 6, 7, 12, 13, 18

work page arXiv 2023
[69]

Deep audio-visual learning: A survey

Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He. Deep audio-visual learning: A survey. Interna- tional Journal of Automation and Computing , 18(3):351– 376, 2021. 3 11 AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting (Supplementary Material)

work page 2021
[70]

More Implementation Details 6.1. Hyper-parameter Configuration Our method is based on SAM2 [51], utilizing the Hi- era base+ and Hiera large backbones within the PyTorch framework, both of them remain frozen during training. We employ a batch size of one, where each batch consists of 5 frames for the V1s and V1m subsets in A VSBench [67], and 10 frames fo...

work page
[71]

In the first four rows, we report the visual prompt results for SAM2, including four uniformly gener- ated points and boxes derived from the ground truth mask

Prompting Engineering We provide additional details on prompt engineering based on the Hiera base+ backbone in A VSBench (V1m)[67], as shown in Tab.7. In the first four rows, we report the visual prompt results for SAM2, including four uniformly gener- ated points and boxes derived from the ground truth mask. Since pixel-level labeled masks are challengin...

work page
[72]

Visualisations In this section, we present qualitative visualization re- sults comparing our method with other adapter-based ap- proaches, GA VS [58] and SAMA-A VS [33]. Specifically, Figures 5 and 6 illustrate the outputs in multimodal scenar- ios involving audio, language, and visual modalities within the Ref-A VS (seen)[59] subset, while Figures 7 and ...

work page

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Multimodal machine learning: A survey and tax- onomy

Tadas Baltru ˇsaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and tax- onomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018. 1

work page 2018

[3] [3]

A survey on deep multimodal learning for com- puter vision: advances, trends, applications, and datasets

Khaled Bayoudh, Raja Knani, Fayc ¸al Hamdaoui, and Abdel- latif Mtibaa. A survey on deep multimodal learning for com- puter vision: advances, trends, applications, and datasets. The Visual Computer, 38(8):2939–2970, 2022. 2

work page 2022

[4] [4]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2

work page 2021

[5] [5]

Vggsound: A large-scale audio-visual dataset

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 721–725. IEEE, 2020. 4

work page 2020

[6] [6]

Localizing visual sounds the hard way

Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Na- grani, Andrea Vedaldi, and Andrew Zisserman. Localizing visual sounds the hard way. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 16867–16876, 2021. 3

work page 2021

[7] [7]

Zero-shot au- dio source separation through query-based learning from weakly-labeled data

Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Zero-shot au- dio source separation through query-based learning from weakly-labeled data. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4441–4449, 2022. 3

work page 2022

[8] [8]

Unraveling in- stance associations: A closer look for audio-visual segmenta- tion

Yuanhong Chen, Yuyuan Liu, Hu Wang, Fengbei Liu, Chong Wang, Helen Frazer, and Gustavo Carneiro. Unraveling in- stance associations: A closer look for audio-visual segmenta- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 26497–26507,

work page

[9] [9]

Ccstereo: Audio-visual contextual and contrastive learning for binaural audio generation

Yuanhong Chen, Kazuki Shimada, Christian Simon, Yukara Ikemiya, Takashi Shibuya, and Yuki Mitsufuji. Ccstereo: Audio-visual contextual and contrastive learning for binaural audio generation. arXiv preprint arXiv:2501.02786, 2025. 3

work page arXiv 2025

[10] [10]

Cpm: Class-conditional prompting ma- chine for audio-visual segmentation

Yuanhong Chen, Chong Wang, Yuyuan Liu, Hu Wang, and Gustavo Carneiro. Cpm: Class-conditional prompting ma- chine for audio-visual segmentation. In European Confer- ence on Computer Vision, pages 438–456. Springer, 2025. 2, 3, 4, 5, 6, 12

work page 2025

[11] [11]

2.5 d visual sound

Ruohan Gao and Kristen Grauman. 2.5 d visual sound. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 324–333, 2019. 3

work page 2019

[12] [12]

Co-separating sounds of visual objects

Ruohan Gao and Kristen Grauman. Co-separating sounds of visual objects. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 3879–3888,

work page

[13] [13]

Avsegformer: Audio-visual segmentation with trans- former

Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, and Tong Lu. Avsegformer: Audio-visual segmentation with trans- former. arXiv preprint arXiv:2307.01146, 2023. 6

work page arXiv 2023

[14] [14]

Improving audio-visual seg- mentation with bidirectional generation

Dawei Hao, Yuxin Mao, Bowen He, Xiaodong Han, Yuchao Dai, and Yiran Zhong. Improving audio-visual seg- mentation with bidirectional generation. arXiv preprint arXiv:2308.08288, 2023. 3, 6

work page arXiv 2023

[15] [15]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 16000– 16009, 2022. 2

work page 2022

[16] [16]

Lever- aging hallucinations to reduce manual prompt dependency in promptable segmentation

Jian Hu, Jiayi Lin, Junchi Yan, and Shaogang Gong. Lever- aging hallucinations to reduce manual prompt dependency in promptable segmentation. arXiv preprint arXiv:2408.15205,

work page arXiv

[17] [17]

Discovering sound- ing objects by audio queries for audio visual segmentation

Shaofei Huang, Han Li, Yuqing Wang, Hongji Zhu, Jiao Dai, Jizhong Han, Wenge Rong, and Si Liu. Discovering sound- ing objects by audio queries for audio visual segmentation. arXiv preprint arXiv:2309.09501, 2023. 3

work page arXiv 2023

[18] [18]

Unleashing the temporal-spatial reasoning capacity of gpt for training-free audio and language referenced video object segmentation

Shaofei Huang, Rui Ling, Hongyu Li, Tianrui Hui, Zongheng Tang, Xiaoming Wei, Jizhong Han, and Si Liu. Unleashing the temporal-spatial reasoning capacity of gpt for training-free audio and language referenced video object segmentation. arXiv preprint arXiv:2408.15876, 2024. 1, 2, 3, 6, 7

work page arXiv 2024

[19] [19]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR,

work page

[20] [20]

Vi- sual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022. 2, 3

work page 2022

[21] [21]

Supervised contrastive learning

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673,

work page

[22] [22]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4015–4026, 2023. 1, 2, 6

work page 2023

[23] [23]

Selm: Selective mechanism based audio-visual segmentation

Jiaxu Li, Songsong Yu, Yifan Wang, Lijun Wang, and Huchuan Lu. Selm: Selective mechanism based audio-visual segmentation. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 3926–3935, 2024. 3, 6 9

work page 2024

[24] [24]

Catr: Combinatorial-dependence audio-queried trans- former for audio-visual video segmentation

Kexin Li, Zongxin Yang, Lei Chen, Yi Yang, and Jun Xun. Catr: Combinatorial-dependence audio-queried trans- former for audio-visual video segmentation. arXiv preprint arXiv:2309.09709, 2023. 3

work page arXiv 2023

[25] [25]

Robust referring video object segmentation with cyclic structural consensus

Xiang Li, Jinglu Wang, Xiaohao Xu, Xiao Li, Bhiksha Raj, and Yan Lu. Robust referring video object segmentation with cyclic structural consensus. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22236– 22245, 2023. 6

work page 2023

[26] [26]

Cascade-clip: Cascaded vision-language embeddings alignment for zero-shot semantic segmentation

Yunheng Li, ZhongYu Li, Quansheng Zeng, Qibin Hou, and Ming-Ming Cheng. Cascade-clip: Cascaded vision-language embeddings alignment for zero-shot semantic segmentation. arXiv preprint arXiv:2406.00670, 2024. 2

work page arXiv 2024

[27] [27]

Feature pyra- mid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- mid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2117–2125, 2017. 2, 4, 5

work page 2017

[28] [28]

Vision transformers are parameter-efficient audio- visual learners

Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, and Gedas Bertasius. Vision transformers are parameter-efficient audio- visual learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 2299– 2309, 2023. 3

work page 2023

[29] [29]

Bavs: Bootstrapping audio- visual segmentation by integrating foundation knowledge

Chen Liu, Peike Li, Hu Zhang, Lincheng Li, Zi Huang, Dadong Wang, and Xin Yu. Bavs: Bootstrapping audio- visual segmentation by integrating foundation knowledge. arXiv preprint arXiv:2308.10175, 2023. 6

work page arXiv 2023

[30] [30]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 1

work page 2024

[31] [31]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 1, 2

work page 2024

[32] [32]

A Survey on Hallucination in Large Vision-Language Models

Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiu- tian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Annotation-free audio-visual segmentation

Jinxiang Liu, Yu Wang, Chen Ju, Chaofan Ma, Ya Zhang, and Weidi Xie. Annotation-free audio-visual segmentation. In Proceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 5604–5614, 2024. 1, 2, 3, 6, 7, 8, 12, 13, 14, 15, 16, 17, 18

work page 2024

[34] [34]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision , pages 38–55. Springer, 2024. 2, 3

work page 2024

[35] [35]

Separate anything you describe

Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D Plumbley, and Wenwu Wang. Separate anything you describe. IEEE/ACM Transactions on Audio, Speech, and Language Processing ,

work page

[36] [36]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 364, 2019. 4

work page internal anchor Pith review Pith/arXiv arXiv 1907

[37] [37]

Contrastive multimodal fu- sion with tupleinfonce

Yunze Liu, Qingnan Fan, Shanghang Zhang, Hao Dong, Thomas Funkhouser, and Li Yi. Contrastive multimodal fu- sion with tupleinfonce. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 754–763,

work page

[38] [38]

Residual pattern learning for pixel-wise out-of-distribution detection in semantic segmentation

Yuyuan Liu, Choubo Ding, Yu Tian, Guansong Pang, Vasileios Belagiannis, Ian Reid, and Gustavo Carneiro. Residual pattern learning for pixel-wise out-of-distribution detection in semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 1151–1161, 2023. 2, 6, 12

work page 2023

[39] [39]

Ittakestwo: Leverag- ing peer representations for semi-supervised lidar semantic segmentation

Yuyuan Liu, Yuanhong Chen, Hu Wang, Vasileios Belagian- nis, Ian Reid, and Gustavo Carneiro. Ittakestwo: Leverag- ing peer representations for semi-supervised lidar semantic segmentation. In European Conference on Computer Vision, pages 81–99. Springer, 2024. 2, 12

work page 2024

[40] [40]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 6, 12

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding. arXiv preprint arXiv:2403.05525,

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

Stepping stones: A progressive training strategy for audio- visual semantic segmentation

Juncheng Ma, Peiwen Sun, Yaoting Wang, and Di Hu. Stepping stones: A progressive training strategy for audio- visual semantic segmentation. IEEE European Conference on Computer Vision (ECCV), 2024. 3, 6, 9, 12

work page 2024

[43] [43]

Multimodal variational auto-encoder based audio-visual segmentation

Yuxin Mao, Jing Zhang, Mochu Xiang, Yiran Zhong, and Yuchao Dai. Multimodal variational auto-encoder based audio-visual segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 954– 965, 2023. 3

work page 2023

[44] [44]

A closer look at weakly- supervised audio-visual source localization

Shentong Mo and Pedro Morgado. A closer look at weakly- supervised audio-visual source localization. arXiv preprint arXiv:2209.09634, 2022. 3

work page arXiv 2022

[45] [45]

Localizing visual sounds the easy way

Shentong Mo and Pedro Morgado. Localizing visual sounds the easy way. In Computer Vision–ECCV 2022: 17th Eu- ropean Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pages 218–234. Springer, 2022. 3

work page 2022

[46] [46]

Weakly-supervised audio- visual segmentation

Shentong Mo and Bhiksha Raj. Weakly-supervised audio- visual segmentation. Advances in Neural Information Pro- cessing Systems, 36:17208–17221, 2023. 3

work page 2023

[47] [47]

arXiv preprint arXiv:2305.01836 (2023)

Shentong Mo and Yapeng Tian. Av-sam: Segment any- thing model meets audio-visual localization and segmenta- tion. arXiv preprint arXiv:2305.01836, 2023. 2, 3

work page arXiv 2023

[48] [48]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. 2, 5, 12

work page internal anchor Pith review Pith/arXiv arXiv 2018

[49] [49]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Learning 10 transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning 10 transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2

work page 2021

[51] [51]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. 1, 2, 4, 5, 6, 8, 12, 13

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Hi- era: A hierarchical vision transformer without the bells-and- whistles

Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, et al. Hi- era: A hierarchical vision transformer without the bells-and- whistles. In International Conference on Machine Learning, pages 29441–29454. PMLR, 2023. 3, 4, 7

work page 2023

[53] [53]

Extending segment anything model into au- ditory and temporal dimensions for audio-visual segmenta- tion

Juhyeong Seon, Woobin Im, Sebin Lee, Jumin Lee, and Sung-Eui Yoon. Extending segment anything model into au- ditory and temporal dimensions for audio-visual segmenta- tion. arXiv preprint arXiv:2406.06163, 2024. 2, 3

work page arXiv 2024

[54] [54]

Long-tail learning with foun- dation model: Heavy fine-tuning hurts

Jiang-Xin Shi, Tong Wei, Zhi Zhou, Jie-Jing Shao, Xin- Yan Han, and Yu-Feng Li. Long-tail learning with foun- dation model: Heavy fine-tuning hurts. arXiv preprint arXiv:2309.10019, 2023. 2

work page arXiv 2023

[55] [55]

Bioclip: A vision foundation model for the tree of life

Samuel Stevens, Jiaman Wu, Matthew J Thompson, Eliza- beth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger- Wolf, et al. Bioclip: A vision foundation model for the tree of life. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 19412–19424,

work page

[56] [56]

Exploring cross-image pixel contrast for semantic segmentation

Wenguan Wang, Tianfei Zhou, Fisher Yu, Jifeng Dai, En- der Konukoglu, and Luc Van Gool. Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7303–7313, 2021. 2, 3, 6, 12

work page 2021

[57] [57]

Pvt v2: Improved baselines with pyramid vision transformer

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022. 12

work page 2022

[58] [58]

Prompting segmentation with sound is gen- eralizable audio-visual source localizer

Yaoting Wang, Weisong Liu, Guangyao Li, Jian Ding, Di Hu, and Xi Li. Prompting segmentation with sound is gen- eralizable audio-visual source localizer. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5669– 5677, 2024. 1, 2, 3, 6, 7, 8, 12, 13, 14, 15, 16, 17, 18

work page 2024

[59] [59]

Ref-avs: Refer and segment objects in audio-visual scenes

Yaoting Wang, Peiwen Sun, Dongzhan Zhou, Guangyao Li, Honggang Zhang, and Di Hu. Ref-avs: Refer and segment objects in audio-visual scenes. In European Conference on Computer Vision, pages 196–213. Springer, 2025. 1, 3, 4, 5, 6, 7, 8, 12, 13, 14, 15

work page 2025

[60] [60]

Language as queries for referring video object seg- mentation

Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. Language as queries for referring video object seg- mentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4974– 4984, 2022. 6

work page 2022

[61] [61]

Multimodal learning with transformers: A survey

Peng Xu, Xiatian Zhu, and David A Clifton. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence , 45(10):12113– 12132, 2023. 1

work page 2023

[62] [62]

Visually informed binaural au- dio generation without binaural audios

Xudong Xu, Hang Zhou, Ziwei Liu, Bo Dai, Xiaogang Wang, and Dahua Lin. Visually informed binaural au- dio generation without binaural audios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15485–15494, 2021. 3

work page 2021

[63] [63]

Cooperation does matter: Exploring multi-order bilateral relations for audio- visual segmentation, 2023

Qi Yang, Xing Nie, Tong Li, Pengfei Gao, Ying Guo, Cheng Zhen, Pengfei Yan, and Shiming Xiang. Cooperation does matter: Exploring multi-order bilateral relations for audio- visual segmentation, 2023. 2, 6

work page 2023

[64] [64]

How can contrastive pre- training benefit audio-visual segmentation? a study from su- pervised and zero-shot perspectives

Jiarui Yu, Haoran Li, Yanbin Hao, Jinmeng Wu, Tong Xu, Shuo Wang, and Xiangnan He. How can contrastive pre- training benefit audio-visual segmentation? a study from su- pervised and zero-shot perspectives. In BMVC, pages 367– 374, 2023. 2, 3, 6

work page 2023

[65] [65]

Mul- timodal contrastive training for visual representation learn- ing

Xin Yuan, Zhe Lin, Jason Kuen, Jianming Zhang, Yilin Wang, Michael Maire, Ajinkya Kale, and Baldo Faieta. Mul- timodal contrastive training for visual representation learn- ing. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 6995–7004,

work page

[66] [66]

Pyramid scene parsing network

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 5

work page 2017

[67] [67]

Audio–visual segmentation

Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong. Audio–visual segmentation. In Computer Vision–ECCV 2022: 17th European Confer- ence, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pages 386–403. Springer, 2022. 1, 2, 3, 5, 6, 7, 8, 12, 13, 16, 17

work page 2022

[68] [68]

Audio-visual segmentation with semantics

Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Ling- peng Kong, Meng Wang, et al. Audio-visual segmentation with semantics. arXiv preprint arXiv:2301.13190, 2023. 1, 3, 6, 7, 12, 13, 18

work page arXiv 2023

[69] [69]

Deep audio-visual learning: A survey

Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He. Deep audio-visual learning: A survey. Interna- tional Journal of Automation and Computing , 18(3):351– 376, 2021. 3 11 AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting (Supplementary Material)

work page 2021

[70] [70]

More Implementation Details 6.1. Hyper-parameter Configuration Our method is based on SAM2 [51], utilizing the Hi- era base+ and Hiera large backbones within the PyTorch framework, both of them remain frozen during training. We employ a batch size of one, where each batch consists of 5 frames for the V1s and V1m subsets in A VSBench [67], and 10 frames fo...

work page

[71] [71]

In the first four rows, we report the visual prompt results for SAM2, including four uniformly gener- ated points and boxes derived from the ground truth mask

Prompting Engineering We provide additional details on prompt engineering based on the Hiera base+ backbone in A VSBench (V1m)[67], as shown in Tab.7. In the first four rows, we report the visual prompt results for SAM2, including four uniformly gener- ated points and boxes derived from the ground truth mask. Since pixel-level labeled masks are challengin...

work page

[72] [72]

Visualisations In this section, we present qualitative visualization re- sults comparing our method with other adapter-based ap- proaches, GA VS [58] and SAMA-A VS [33]. Specifically, Figures 5 and 6 illustrate the outputs in multimodal scenar- ios involving audio, language, and visual modalities within the Ref-A VS (seen)[59] subset, while Figures 7 and ...

work page