AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting
Pith reviewed 2026-05-19 11:11 UTC · model grok-4.3
The pith
AuralSAM2 adds audio to SAM2 by propagating fused audio-visual prompts through the model's feature pyramid and an audio-guided contrastive loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AuralSAM2 integrates audio into SAM2 while largely preserving its promptable segmentation capability. Its core module, AuralFuser, fuses audio and visual features to generate sparse and dense prompts. Guided by audio and built upon SAM2's feature pyramid, these prompts propagate auditory cues across visual layers, reinforcing cross-modal influence. To further align modalities, an audio-guided contrastive loss emphasises auditory relevance in dominant visual features.
What carries the argument
AuralFuser, which fuses audio and visual features on top of SAM2's feature pyramid to produce audio-guided sparse and dense prompts that propagate cross-modal cues through the network layers.
If this is right
- Notable accuracy gains on public audio-visual segmentation benchmarks.
- Only minimal impact on the interactive efficiency of promptable segmentation.
- Reduced audio prompt dilution compared to earlier adapter-based fusion methods.
- Preserved ability to use visual prompts alone without performance loss.
Where Pith is reading between the lines
- The approach could be tested on longer video sequences where audio cues might help maintain object identity across cuts or occlusions.
- Similar pyramid fusion might improve other promptable models when adding sound or other non-visual signals.
- It opens a path to segmentation systems that switch between visual, audio, or combined prompts depending on which modality is clearest in a given frame.
Load-bearing premise
The pyramid propagation of auditory cues and the audio-guided contrastive loss will reinforce cross-modal influence without causing audio prompt dilution or harming SAM2's original generalization on visual prompts.
What would settle it
Running the method on a standard audio-visual segmentation benchmark and finding either no measurable accuracy gain over baseline SAM2 or a clear drop in frames-per-second during interactive prompting.
Figures
read the original abstract
Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches either convert audio into visual prompts (e.g., boxes) via foundation models, or inject adapters into the image encoder for audio-visual fusion. Yet both directions fall short in human-in-the-loop scenarios due to limited prompt accuracy and increased inference overhead. In particular, these adapter-based methods often suffer from audio prompt dilution, where the signal gradually weakens as it propagates through the network. In this work, we propose AuralSAM2, which integrates audio into SAM2 while largely preserving its promptable segmentation capability. Its core module, AuralFuser, fuses audio and visual features to generate sparse and dense prompts. Guided by audio and built upon SAM2's feature pyramid, these prompts propagate auditory cues across visual layers, reinforcing cross-modal influence. To further align modalities, we introduce an audio-guided contrastive loss that emphasises auditory relevance in dominant visual features. Our method achieves notable accuracy gains on public benchmarks with only minimal impact on the interactive efficiency of promptable segmentation. Our code is available at https://github.com/yyliu01/AuralSAM2.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AuralSAM2, an extension of SAM2 for audio-visual promptable segmentation in video. It proposes the AuralFuser module that fuses audio and visual features to generate sparse and dense prompts propagating auditory cues through SAM2's feature pyramid, together with an audio-guided contrastive loss to align modalities. The central claims are notable accuracy gains on public benchmarks, minimal impact on interactive efficiency, and largely preserved generalization on visual prompts.
Significance. If the accuracy gains hold without degrading visual-prompt performance, the work would meaningfully advance audio integration into promptable video segmentation models by mitigating prompt dilution while retaining SAM2's interactive strengths. The public code release is a positive factor for reproducibility.
major comments (1)
- [Experiments] Experiments section: No ablation or side-by-side evaluation is reported comparing AuralSAM2 to unmodified SAM2 on standard visual-only prompt tasks (box/point prompts on SA-V or DAVIS). This directly undermines the claim that the pyramid propagation via AuralFuser and the contrastive loss 'largely preserve' SAM2's original generalization, as even small alterations to visual feature pathways could cause negative transfer.
minor comments (1)
- [Abstract] Abstract: The phrase 'notable accuracy gains' is not accompanied by specific metrics, datasets, or baseline comparisons, making the headline claim harder to evaluate at a glance.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the recommendation for major revision. We address the major comment point by point below.
read point-by-point responses
-
Referee: [Experiments] Experiments section: No ablation or side-by-side evaluation is reported comparing AuralSAM2 to unmodified SAM2 on standard visual-only prompt tasks (box/point prompts on SA-V or DAVIS). This directly undermines the claim that the pyramid propagation via AuralFuser and the contrastive loss 'largely preserve' SAM2's original generalization, as even small alterations to visual feature pathways could cause negative transfer.
Authors: We acknowledge this point. To strengthen the evidence that our modifications largely preserve SAM2's generalization on visual prompts, we will include in the revised manuscript additional experiments that directly compare AuralSAM2 to the unmodified SAM2 using box and point prompts on the SA-V and DAVIS datasets. These evaluations will be performed in a visual-only setting to demonstrate the absence of negative transfer. revision: yes
Circularity Check
No circularity: novel AuralFuser and contrastive loss form independent derivation chain
full rationale
The paper introduces AuralFuser as a new module that fuses audio-visual features to generate sparse/dense prompts on SAM2's pyramid, plus an audio-guided contrastive loss for modality alignment. These are presented as architectural additions rather than quantities derived from or fitted to prior outputs by the same authors. Claims of accuracy gains with preserved promptable segmentation rest on benchmark evaluations and the explicit design choices for cross-modal propagation, without self-definitional reductions, fitted-input predictions, or load-bearing self-citations. The derivation is self-contained against external SAM2 baselines and public datasets.
Axiom & Free-Parameter Ledger
free parameters (1)
- contrastive loss weighting factor
axioms (1)
- domain assumption SAM2 feature pyramid layers can effectively propagate auditory cues to reinforce cross-modal influence
invented entities (1)
-
AuralFuser
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AuralFuser ... fuses audio and visual features to generate sparse and dense prompts ... audio-guided contrastive loss
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
feature pyramid ... multi-scale feature fusion
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding
A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Multimodal machine learning: A survey and tax- onomy
Tadas Baltru ˇsaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and tax- onomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018. 1
work page 2018
-
[3]
Khaled Bayoudh, Raja Knani, Fayc ¸al Hamdaoui, and Abdel- latif Mtibaa. A survey on deep multimodal learning for com- puter vision: advances, trends, applications, and datasets. The Visual Computer, 38(8):2939–2970, 2022. 2
work page 2022
-
[4]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2
work page 2021
-
[5]
Vggsound: A large-scale audio-visual dataset
Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 721–725. IEEE, 2020. 4
work page 2020
-
[6]
Localizing visual sounds the hard way
Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Na- grani, Andrea Vedaldi, and Andrew Zisserman. Localizing visual sounds the hard way. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 16867–16876, 2021. 3
work page 2021
-
[7]
Zero-shot au- dio source separation through query-based learning from weakly-labeled data
Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Zero-shot au- dio source separation through query-based learning from weakly-labeled data. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4441–4449, 2022. 3
work page 2022
-
[8]
Unraveling in- stance associations: A closer look for audio-visual segmenta- tion
Yuanhong Chen, Yuyuan Liu, Hu Wang, Fengbei Liu, Chong Wang, Helen Frazer, and Gustavo Carneiro. Unraveling in- stance associations: A closer look for audio-visual segmenta- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 26497–26507,
-
[9]
Ccstereo: Audio-visual contextual and contrastive learning for binaural audio generation
Yuanhong Chen, Kazuki Shimada, Christian Simon, Yukara Ikemiya, Takashi Shibuya, and Yuki Mitsufuji. Ccstereo: Audio-visual contextual and contrastive learning for binaural audio generation. arXiv preprint arXiv:2501.02786, 2025. 3
-
[10]
Cpm: Class-conditional prompting ma- chine for audio-visual segmentation
Yuanhong Chen, Chong Wang, Yuyuan Liu, Hu Wang, and Gustavo Carneiro. Cpm: Class-conditional prompting ma- chine for audio-visual segmentation. In European Confer- ence on Computer Vision, pages 438–456. Springer, 2025. 2, 3, 4, 5, 6, 12
work page 2025
-
[11]
Ruohan Gao and Kristen Grauman. 2.5 d visual sound. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 324–333, 2019. 3
work page 2019
-
[12]
Co-separating sounds of visual objects
Ruohan Gao and Kristen Grauman. Co-separating sounds of visual objects. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 3879–3888,
-
[13]
Avsegformer: Audio-visual segmentation with trans- former
Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, and Tong Lu. Avsegformer: Audio-visual segmentation with trans- former. arXiv preprint arXiv:2307.01146, 2023. 6
-
[14]
Improving audio-visual seg- mentation with bidirectional generation
Dawei Hao, Yuxin Mao, Bowen He, Xiaodong Han, Yuchao Dai, and Yiran Zhong. Improving audio-visual seg- mentation with bidirectional generation. arXiv preprint arXiv:2308.08288, 2023. 3, 6
-
[15]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 16000– 16009, 2022. 2
work page 2022
-
[16]
Lever- aging hallucinations to reduce manual prompt dependency in promptable segmentation
Jian Hu, Jiayi Lin, Junchi Yan, and Shaogang Gong. Lever- aging hallucinations to reduce manual prompt dependency in promptable segmentation. arXiv preprint arXiv:2408.15205,
-
[17]
Discovering sound- ing objects by audio queries for audio visual segmentation
Shaofei Huang, Han Li, Yuqing Wang, Hongji Zhu, Jiao Dai, Jizhong Han, Wenge Rong, and Si Liu. Discovering sound- ing objects by audio queries for audio visual segmentation. arXiv preprint arXiv:2309.09501, 2023. 3
-
[18]
Shaofei Huang, Rui Ling, Hongyu Li, Tianrui Hui, Zongheng Tang, Xiaoming Wei, Jizhong Han, and Si Liu. Unleashing the temporal-spatial reasoning capacity of gpt for training-free audio and language referenced video object segmentation. arXiv preprint arXiv:2408.15876, 2024. 1, 2, 3, 6, 7
-
[19]
Scaling up visual and vision-language representa- tion learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR,
-
[20]
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022. 2, 3
work page 2022
-
[21]
Supervised contrastive learning
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673,
-
[22]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4015–4026, 2023. 1, 2, 6
work page 2023
-
[23]
Selm: Selective mechanism based audio-visual segmentation
Jiaxu Li, Songsong Yu, Yifan Wang, Lijun Wang, and Huchuan Lu. Selm: Selective mechanism based audio-visual segmentation. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 3926–3935, 2024. 3, 6 9
work page 2024
-
[24]
Catr: Combinatorial-dependence audio-queried trans- former for audio-visual video segmentation
Kexin Li, Zongxin Yang, Lei Chen, Yi Yang, and Jun Xun. Catr: Combinatorial-dependence audio-queried trans- former for audio-visual video segmentation. arXiv preprint arXiv:2309.09709, 2023. 3
-
[25]
Robust referring video object segmentation with cyclic structural consensus
Xiang Li, Jinglu Wang, Xiaohao Xu, Xiao Li, Bhiksha Raj, and Yan Lu. Robust referring video object segmentation with cyclic structural consensus. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22236– 22245, 2023. 6
work page 2023
-
[26]
Cascade-clip: Cascaded vision-language embeddings alignment for zero-shot semantic segmentation
Yunheng Li, ZhongYu Li, Quansheng Zeng, Qibin Hou, and Ming-Ming Cheng. Cascade-clip: Cascaded vision-language embeddings alignment for zero-shot semantic segmentation. arXiv preprint arXiv:2406.00670, 2024. 2
-
[27]
Feature pyra- mid networks for object detection
Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- mid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2117–2125, 2017. 2, 4, 5
work page 2017
-
[28]
Vision transformers are parameter-efficient audio- visual learners
Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, and Gedas Bertasius. Vision transformers are parameter-efficient audio- visual learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 2299– 2309, 2023. 3
work page 2023
-
[29]
Bavs: Bootstrapping audio- visual segmentation by integrating foundation knowledge
Chen Liu, Peike Li, Hu Zhang, Lincheng Li, Zi Huang, Dadong Wang, and Xin Yu. Bavs: Bootstrapping audio- visual segmentation by integrating foundation knowledge. arXiv preprint arXiv:2308.10175, 2023. 6
-
[30]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 1
work page 2024
-
[31]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 1, 2
work page 2024
-
[32]
A Survey on Hallucination in Large Vision-Language Models
Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiu- tian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Annotation-free audio-visual segmentation
Jinxiang Liu, Yu Wang, Chen Ju, Chaofan Ma, Ya Zhang, and Weidi Xie. Annotation-free audio-visual segmentation. In Proceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 5604–5614, 2024. 1, 2, 3, 6, 7, 8, 12, 13, 14, 15, 16, 17, 18
work page 2024
-
[34]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision , pages 38–55. Springer, 2024. 2, 3
work page 2024
-
[35]
Separate anything you describe
Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D Plumbley, and Wenwu Wang. Separate anything you describe. IEEE/ACM Transactions on Audio, Speech, and Language Processing ,
-
[36]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 364, 2019. 4
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[37]
Contrastive multimodal fu- sion with tupleinfonce
Yunze Liu, Qingnan Fan, Shanghang Zhang, Hao Dong, Thomas Funkhouser, and Li Yi. Contrastive multimodal fu- sion with tupleinfonce. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 754–763,
-
[38]
Residual pattern learning for pixel-wise out-of-distribution detection in semantic segmentation
Yuyuan Liu, Choubo Ding, Yu Tian, Guansong Pang, Vasileios Belagiannis, Ian Reid, and Gustavo Carneiro. Residual pattern learning for pixel-wise out-of-distribution detection in semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 1151–1161, 2023. 2, 6, 12
work page 2023
-
[39]
Ittakestwo: Leverag- ing peer representations for semi-supervised lidar semantic segmentation
Yuyuan Liu, Yuanhong Chen, Hu Wang, Vasileios Belagian- nis, Ian Reid, and Gustavo Carneiro. Ittakestwo: Leverag- ing peer representations for semi-supervised lidar semantic segmentation. In European Conference on Computer Vision, pages 81–99. Springer, 2024. 2, 12
work page 2024
-
[40]
Decoupled Weight Decay Regularization
I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 6, 12
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[41]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding. arXiv preprint arXiv:2403.05525,
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Stepping stones: A progressive training strategy for audio- visual semantic segmentation
Juncheng Ma, Peiwen Sun, Yaoting Wang, and Di Hu. Stepping stones: A progressive training strategy for audio- visual semantic segmentation. IEEE European Conference on Computer Vision (ECCV), 2024. 3, 6, 9, 12
work page 2024
-
[43]
Multimodal variational auto-encoder based audio-visual segmentation
Yuxin Mao, Jing Zhang, Mochu Xiang, Yiran Zhong, and Yuchao Dai. Multimodal variational auto-encoder based audio-visual segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 954– 965, 2023. 3
work page 2023
-
[44]
A closer look at weakly- supervised audio-visual source localization
Shentong Mo and Pedro Morgado. A closer look at weakly- supervised audio-visual source localization. arXiv preprint arXiv:2209.09634, 2022. 3
-
[45]
Localizing visual sounds the easy way
Shentong Mo and Pedro Morgado. Localizing visual sounds the easy way. In Computer Vision–ECCV 2022: 17th Eu- ropean Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pages 218–234. Springer, 2022. 3
work page 2022
-
[46]
Weakly-supervised audio- visual segmentation
Shentong Mo and Bhiksha Raj. Weakly-supervised audio- visual segmentation. Advances in Neural Information Pro- cessing Systems, 36:17208–17221, 2023. 3
work page 2023
-
[47]
arXiv preprint arXiv:2305.01836 (2023)
Shentong Mo and Yapeng Tian. Av-sam: Segment any- thing model meets audio-visual localization and segmenta- tion. arXiv preprint arXiv:2305.01836, 2023. 2, 3
-
[48]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. 2, 5, 12
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[49]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Learning 10 transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning 10 transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2
work page 2021
-
[51]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. 1, 2, 4, 5, 6, 8, 12, 13
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Hi- era: A hierarchical vision transformer without the bells-and- whistles
Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, et al. Hi- era: A hierarchical vision transformer without the bells-and- whistles. In International Conference on Machine Learning, pages 29441–29454. PMLR, 2023. 3, 4, 7
work page 2023
-
[53]
Juhyeong Seon, Woobin Im, Sebin Lee, Jumin Lee, and Sung-Eui Yoon. Extending segment anything model into au- ditory and temporal dimensions for audio-visual segmenta- tion. arXiv preprint arXiv:2406.06163, 2024. 2, 3
-
[54]
Long-tail learning with foun- dation model: Heavy fine-tuning hurts
Jiang-Xin Shi, Tong Wei, Zhi Zhou, Jie-Jing Shao, Xin- Yan Han, and Yu-Feng Li. Long-tail learning with foun- dation model: Heavy fine-tuning hurts. arXiv preprint arXiv:2309.10019, 2023. 2
-
[55]
Bioclip: A vision foundation model for the tree of life
Samuel Stevens, Jiaman Wu, Matthew J Thompson, Eliza- beth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger- Wolf, et al. Bioclip: A vision foundation model for the tree of life. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 19412–19424,
-
[56]
Exploring cross-image pixel contrast for semantic segmentation
Wenguan Wang, Tianfei Zhou, Fisher Yu, Jifeng Dai, En- der Konukoglu, and Luc Van Gool. Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7303–7313, 2021. 2, 3, 6, 12
work page 2021
-
[57]
Pvt v2: Improved baselines with pyramid vision transformer
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022. 12
work page 2022
-
[58]
Prompting segmentation with sound is gen- eralizable audio-visual source localizer
Yaoting Wang, Weisong Liu, Guangyao Li, Jian Ding, Di Hu, and Xi Li. Prompting segmentation with sound is gen- eralizable audio-visual source localizer. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5669– 5677, 2024. 1, 2, 3, 6, 7, 8, 12, 13, 14, 15, 16, 17, 18
work page 2024
-
[59]
Ref-avs: Refer and segment objects in audio-visual scenes
Yaoting Wang, Peiwen Sun, Dongzhan Zhou, Guangyao Li, Honggang Zhang, and Di Hu. Ref-avs: Refer and segment objects in audio-visual scenes. In European Conference on Computer Vision, pages 196–213. Springer, 2025. 1, 3, 4, 5, 6, 7, 8, 12, 13, 14, 15
work page 2025
-
[60]
Language as queries for referring video object seg- mentation
Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. Language as queries for referring video object seg- mentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4974– 4984, 2022. 6
work page 2022
-
[61]
Multimodal learning with transformers: A survey
Peng Xu, Xiatian Zhu, and David A Clifton. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence , 45(10):12113– 12132, 2023. 1
work page 2023
-
[62]
Visually informed binaural au- dio generation without binaural audios
Xudong Xu, Hang Zhou, Ziwei Liu, Bo Dai, Xiaogang Wang, and Dahua Lin. Visually informed binaural au- dio generation without binaural audios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15485–15494, 2021. 3
work page 2021
-
[63]
Qi Yang, Xing Nie, Tong Li, Pengfei Gao, Ying Guo, Cheng Zhen, Pengfei Yan, and Shiming Xiang. Cooperation does matter: Exploring multi-order bilateral relations for audio- visual segmentation, 2023. 2, 6
work page 2023
-
[64]
Jiarui Yu, Haoran Li, Yanbin Hao, Jinmeng Wu, Tong Xu, Shuo Wang, and Xiangnan He. How can contrastive pre- training benefit audio-visual segmentation? a study from su- pervised and zero-shot perspectives. In BMVC, pages 367– 374, 2023. 2, 3, 6
work page 2023
-
[65]
Mul- timodal contrastive training for visual representation learn- ing
Xin Yuan, Zhe Lin, Jason Kuen, Jianming Zhang, Yilin Wang, Michael Maire, Ajinkya Kale, and Baldo Faieta. Mul- timodal contrastive training for visual representation learn- ing. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 6995–7004,
-
[66]
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 5
work page 2017
-
[67]
Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong. Audio–visual segmentation. In Computer Vision–ECCV 2022: 17th European Confer- ence, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pages 386–403. Springer, 2022. 1, 2, 3, 5, 6, 7, 8, 12, 13, 16, 17
work page 2022
-
[68]
Audio-visual segmentation with semantics
Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Ling- peng Kong, Meng Wang, et al. Audio-visual segmentation with semantics. arXiv preprint arXiv:2301.13190, 2023. 1, 3, 6, 7, 12, 13, 18
-
[69]
Deep audio-visual learning: A survey
Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He. Deep audio-visual learning: A survey. Interna- tional Journal of Automation and Computing , 18(3):351– 376, 2021. 3 11 AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting (Supplementary Material)
work page 2021
-
[70]
More Implementation Details 6.1. Hyper-parameter Configuration Our method is based on SAM2 [51], utilizing the Hi- era base+ and Hiera large backbones within the PyTorch framework, both of them remain frozen during training. We employ a batch size of one, where each batch consists of 5 frames for the V1s and V1m subsets in A VSBench [67], and 10 frames fo...
-
[71]
Prompting Engineering We provide additional details on prompt engineering based on the Hiera base+ backbone in A VSBench (V1m)[67], as shown in Tab.7. In the first four rows, we report the visual prompt results for SAM2, including four uniformly gener- ated points and boxes derived from the ground truth mask. Since pixel-level labeled masks are challengin...
-
[72]
Visualisations In this section, we present qualitative visualization re- sults comparing our method with other adapter-based ap- proaches, GA VS [58] and SAMA-A VS [33]. Specifically, Figures 5 and 6 illustrate the outputs in multimodal scenar- ios involving audio, language, and visual modalities within the Ref-A VS (seen)[59] subset, while Figures 7 and ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.