SALMONN: Towards Generic Hearing Abilities for Large Language Models
Pith reviewed 2026-05-18 02:24 UTC · model grok-4.3
The pith
SALMONN integrates pre-trained speech and audio encoders with a large language model to enable direct processing and understanding of general audio inputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SALMONN integrates a pre-trained text-based large language model with speech and audio encoders into a single multimodal model. The model directly processes general audio inputs consisting of speech, audio events, and music. It reaches competitive performance on trained tasks such as automatic speech recognition and translation, auditory question answering, emotion recognition, speaker verification, and music and audio captioning. It also exhibits emergent abilities unseen during training, such as speech translation to untrained languages, speech-based slot filling, spoken-query question answering, audio-based storytelling, and speech-audio co-reasoning. A novel few-shot activation tuning is
What carries the argument
The architecture that feeds features from pre-trained speech and audio encoders into the LLM, together with few-shot activation tuning to bring out cross-modal emergent abilities.
If this is right
- A single model can replace multiple task-specific audio systems for speech recognition, translation, emotion detection, and captioning.
- Emergent abilities such as translation to untrained languages and audio storytelling appear without explicit training for those tasks.
- Few-shot activation tuning can bring out cross-modal abilities that were not present after standard training.
- The model supports combined speech and audio inputs for co-reasoning tasks.
- General auditory inputs become usable for question answering and slot filling without additional modules.
Where Pith is reading between the lines
- The same encoder-integration pattern could be applied to add video or other sensory streams to create models with combined perceptual abilities.
- If the emergent abilities prove robust, future audio AI work may need far less task-specific labeled data once rich encoder features are available.
- Real deployments might allow AI systems to react to background sounds and environmental audio alongside spoken commands.
- Controlled tests with overlapping sounds or added noise would check whether the reported generic abilities survive outside clean training conditions.
Load-bearing premise
Features from the pre-trained speech and audio encoders, once integrated with the LLM through the chosen architecture and training, produce genuine understanding of general auditory information rather than task-specific pattern matching.
What would settle it
If the model continues to succeed on the trained audio tasks but fails on a new type of auditory reasoning task that has no overlap with the training distribution, such as identifying and describing an unfamiliar environmental sound event in context, the claim of generic hearing abilities would not hold.
read the original abstract
Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music. In this paper, we propose SALMONN, a speech audio language music open neural network, built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model. SALMONN enables the LLM to directly process and understand general audio inputs and achieve competitive performances on a number of speech and audio tasks used in training, such as automatic speech recognition and translation, auditory-information-based question answering, emotion recognition, speaker verification, and music and audio captioning etc. SALMONN also has a diverse set of emergent abilities unseen in the training, which includes but is not limited to speech translation to untrained languages, speech-based slot filling, spoken-query-based question answering, audio-based storytelling, and speech audio co-reasoning etc. The presence of cross-modal emergent abilities is studied, and a novel few-shot activation tuning approach is proposed to activate such abilities. To our knowledge, SALMONN is the first model of its type and can be regarded as a step towards AI with generic hearing abilities. The source code, model checkpoints and data are available at https://github.com/bytedance/SALMONN.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SALMONN, a multimodal model that integrates pre-trained Whisper speech and BEATs audio encoders with a text-based LLM via a Q-Former connector. It uses multi-task instruction tuning on a mixture of speech and audio tasks followed by few-shot activation tuning to achieve competitive results on in-distribution tasks including ASR, speech translation, auditory QA, emotion recognition, speaker verification, and music/audio captioning, while also exhibiting emergent cross-modal abilities such as translation to untrained languages, speech-based slot filling, spoken-query QA, audio storytelling, and speech-audio co-reasoning.
Significance. If the quantitative results and emergent-ability demonstrations hold under rigorous scrutiny, the work provides a concrete step toward generic auditory understanding in LLMs, extending their utility to real-world audio inputs beyond text. The open release of code, model checkpoints, and training data is a clear strength that supports reproducibility and follow-on research. The proposed activation-tuning method offers a practical technique for eliciting cross-modal generalization that may generalize to other multimodal settings.
major comments (2)
- [Section 5] Section 5 (Experiments): the claim of 'competitive performances' on training tasks is not accompanied by per-task numerical tables comparing against strong baselines (e.g., Whisper-only, AudioLM, or prior multimodal LLMs); without these numbers and statistical significance tests, it is difficult to judge whether the observed gains are attributable to the architecture or to the scale of the training mixture.
- [Section 4.3] Section 4.3 (Few-shot Activation Tuning): the procedure for selecting the few-shot examples and the precise mechanism by which they 'activate' emergent abilities is described at a high level; an ablation removing the activation-tuning stage or varying the number of shots would be required to establish that this step is load-bearing for the reported emergent behaviors rather than an artifact of prompt engineering.
minor comments (2)
- [Figure 2] Figure 2 (model diagram): the flow from audio encoder outputs through the Q-Former to the LLM token embeddings is visually clear but lacks explicit dimension annotations on the connector layers, which would aid readers in reproducing the exact architecture.
- [Section 3.2] Section 3.2: the description of the multi-task data mixture would benefit from an explicit table listing the proportion or number of examples per task type to clarify how the training distribution was balanced.
Simulated Author's Rebuttal
We thank the referee for the positive assessment, the recommendation of minor revision, and the constructive comments on Sections 5 and 4.3. We address each point below and will update the manuscript accordingly to strengthen the presentation of results and the activation-tuning procedure.
read point-by-point responses
-
Referee: [Section 5] Section 5 (Experiments): the claim of 'competitive performances' on training tasks is not accompanied by per-task numerical tables comparing against strong baselines (e.g., Whisper-only, AudioLM, or prior multimodal LLMs); without these numbers and statistical significance tests, it is difficult to judge whether the observed gains are attributable to the architecture or to the scale of the training mixture.
Authors: We thank the referee for this observation. The current manuscript reports competitive results on the training tasks but presents them primarily through figures and selected comparisons rather than exhaustive per-task tables. We agree that adding systematic numerical tables against strong baselines (Whisper, AudioLM, and prior multimodal LLMs) together with statistical significance tests would make the claims more rigorous and help isolate the contribution of the proposed architecture from training data scale. In the revised version we will include such tables for ASR, speech translation, auditory QA, emotion recognition, speaker verification, and music/audio captioning, reporting means, standard deviations, and p-values or confidence intervals where appropriate. revision: yes
-
Referee: [Section 4.3] Section 4.3 (Few-shot Activation Tuning): the procedure for selecting the few-shot examples and the precise mechanism by which they 'activate' emergent abilities is described at a high level; an ablation removing the activation-tuning stage or varying the number of shots would be required to establish that this step is load-bearing for the reported emergent behaviors rather than an artifact of prompt engineering.
Authors: We agree that the description of few-shot example selection and the activation mechanism is currently high-level and that ablations are needed to demonstrate the contribution of this stage. In the revision we will expand Section 4.3 with a more detailed account of how the few-shot examples are chosen (including criteria and examples) and the hypothesized activation process. We will also add the requested ablations: (i) performance with the activation-tuning stage removed entirely, and (ii) performance when varying the number of shots (0, 1, 5, and 10). These experiments will help confirm that the reported emergent abilities depend on the activation-tuning step rather than prompt engineering alone. revision: yes
Circularity Check
No significant circularity; empirical model construction
full rationale
The paper presents SALMONN as an engineering integration of pre-trained Whisper and BEATs encoders with an LLM via Q-Former and instruction tuning, followed by empirical evaluation on speech/audio tasks and observation of emergent behaviors. No mathematical derivation, first-principles equations, or claimed predictions are offered that could reduce to fitted inputs or self-citations by construction. All results follow from standard training and testing procedures on external datasets; the manuscript remains self-contained without invoking load-bearing self-referential steps of the enumerated kinds.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pre-trained speech and audio encoders capture features sufficient for general auditory understanding when connected to an LLM.
Forward citations
Cited by 17 Pith papers
-
Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning
RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
-
The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models
TWNM framework equips audio-language models with spatial scene analysis via FOA simulation and metadata-grounded training, reaching 70.8% accuracy on a new ASA benchmark.
-
M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation
M³KG-RAG improves multimodal reasoning in large language models by constructing multi-hop knowledge graphs and selectively pruning retrieved context with GRASP.
-
Protecting Bystander Privacy via Selective Hearing in Audio LLMs
Audio LLMs leak bystander speech; SH-Bench benchmark and BPFT fine-tuning raise selective accuracy by 47% and selective efficacy by 16% over Gemini 2.5 Pro.
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
VoiceBench: Benchmarking LLM-Based Voice Assistants
VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.
-
HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models
HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
-
Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs
NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.
-
QoS-QoE Translation with Large Language Model
A new QoS-QoE Translation dataset is constructed from multimedia literature and fine-tuned LLMs demonstrate strong performance on bidirectional continuous and discrete QoS-QoE predictions.
-
MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages
MCAT scales MLLMs to many-to-many speech translation across 70 languages via curriculum learning and a 30-token speech adapter, surpassing prior SOTA on FLEURS while improving speed.
-
Step-Audio 2 Technical Report
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and c...
-
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Spatial-MLLM boosts MLLM spatial intelligence from 2D inputs via dual encoders initialized from geometry models plus space-aware sampling, claiming state-of-the-art results.
-
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
-
Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition
The authors introduce LLM-based semantic judgment and an agentic interaction loop that improves semantic fidelity and enables iterative corrections in automatic speech recognition beyond traditional WER.
-
Direct Simultaneous Translation Activation for Large Audio-Language Models
Augmenting standard offline training data with only 1% randomly truncated simultaneous examples activates real-time translation output in large audio-language models with no architecture or decoding changes.
-
Kimi-Audio Technical Report
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
-
Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss
A cross-modal attention refinement module plus hybrid loss improves robustness of audio-text retrieval on noisy and long-form audio.
Reference graph
Works this paper leans on
-
[2]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, et al. Flamingo: a visual language model for few-shot learning. In Proc. NeurIPS, New Orleans, 2022
work page 2022
-
[3]
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. PaLM 2 technical report. arXiv:2305.10403, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
SLURP : A spoken language understanding resource package
Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojanski, and Verena Rieser. SLURP : A spoken language understanding resource package. In Proc. EMNLP, 2020
work page 2020
-
[5]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Proc. NeurIPS, New Orleans, 2020
work page 2020
-
[6]
IEMOCAP : Interactive emotional dyadic motion capture database
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. IEMOCAP : Interactive emotional dyadic motion capture database. Language resources and evaluation, 42: 0 335--359, 2008
work page 2008
-
[7]
Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, Jing Shi, Shuang Xu, and Bo Xu. X-LLM : B ootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv:2305.04160, 2023 a
-
[8]
VideoLLM : Modeling video sequence with large language models
Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, et al. VideoLLM : Modeling video sequence with large language models. arXiv:2305.13292, 2023 b
-
[9]
GigaSpeech : A n evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio
Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al. GigaSpeech : A n evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio. In Proc. Interspeech, Brno, 2021
work page 2021
-
[10]
WavLM : Large-scale self-supervised pre-training for full stack speech processing
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. WavLM : Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16 0 (6): 0 1505--1518, 2022
work page 2022
-
[11]
BEATs : Audio pre-training with acoustic tokenizers
Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, and Furu Wei. BEATs : Audio pre-training with acoustic tokenizers. In Proc. ICML, Honolulu, 2023 c
work page 2023
-
[12]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing GPT-4 with 90\ ChatGPT quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/
work page 2023
-
[14]
LibriMix : An open-source dataset for generalizable speech separation
Joris Cosentino, Manuel Pariente, Samuele Cornell, Antoine Deleforge, and Emmanuel Vincent. LibriMix : An open-source dataset for generalizable speech separation. arXiv:2005.11262, 2020
-
[15]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, et al. InstructBLIP : Towards general-purpose vision-language models with instruction tuning. arXiv:2305.06500, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
LP-MusicCaps : LLM -based pseudo music captioning
SeungHeon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. LP-MusicCaps : LLM -based pseudo music captioning. arXiv:2307.16372, 2023
-
[17]
Clotho: An audio captioning dataset
Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: An audio captioning dataset. In Proc. ICASSP, Barcelona, 2020
work page 2020
-
[18]
GLM : General language model pretraining with autoregressive blank infilling
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. GLM : General language model pretraining with autoregressive blank infilling. In Proc. ACL, Dublin, Ireland, 2022
work page 2022
-
[20]
Whisper-AT : Noise-robust automatic speech recognizers are also strong general audio event taggers
Yuan Gong, Sameer Khurana, Leonid Karlinsky, and James Glass. Whisper-AT : Noise-robust automatic speech recognizers are also strong general audio event taggers. In Proc. Interspeech, Dublin, Ireland, 2023 a
work page 2023
-
[22]
Haohan Guo, Shaofei Zhang, Frank K. Soong, Lei He, and Lei Xie. Conversational end-to-end TTS for voice agents. In Proc. SLT, 2021
work page 2021
-
[23]
LoRA : Low-Rank Adaptation of large language models
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA : Low-Rank Adaptation of large language models. In Proc. ICLR, 2022
work page 2022
-
[24]
Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, et al. Dynamic-SUPERB : T owards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech. arXiv:2309.09510, 2023 a
-
[25]
AudioGPT : U nderstanding and generating speech, music, sound, and talking head
Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, et al. AudioGPT : U nderstanding and generating speech, music, sound, and talking head. arXiv:2304.12995, 2023 b
-
[26]
Adapting self-supervised models to multi-talker speech recognition using speaker embeddings
Zili Huang, Desh Raj, Paola Garc \' a, and Sanjeev Khudanpur. Adapting self-supervised models to multi-talker speech recognition using speaker embeddings. In Proc. ICASSP, Rhodes, Greek, 2023 c
work page 2023
-
[27]
Improved automatic keyword extraction given more linguistic knowledge
Anette Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proc. EMNLP, Sapporo, Japan, 2003
work page 2003
-
[28]
AudioCaps : G enerating captions for audios in the wild
Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. AudioCaps : G enerating captions for audios in the wild. In Proc. NAACL-HLT, Minneapolis, 2019
work page 2019
-
[29]
The L ombard sign and the role of hearing in speech
Harlen Lane and Bernard Tranel. The L ombard sign and the role of hearing in speech. Journal of Speech Hearing Research, 14: 0 677--709, 1971
work page 1971
-
[30]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2 : Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proc. ICML, Hawaii, 2023 a
work page 2023
-
[31]
MERT : A coustic music understanding model with large-scale self-supervised training
Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghua Lin, Anton Ragni, Emmanouil Benetos, Norbert Gyenge, et al. MERT : A coustic music understanding model with large-scale self-supervised training. arXiv:2306.00107, 2023 b
-
[32]
Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, and Ying Shan. Music understanding LLaMA : A dvancing text-to-music generation with question answering and captioning. arXiv:2308.11276, 2023
-
[33]
Macaw-LLM : Multi-modal language modeling with image, audio, video, and text integration
Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, et al. Macaw-LLM : Multi-modal language modeling with image, audio, video, and text integration. arXiv:2306.09093, 2023
-
[34]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-ChatGPT : Towards detailed video understanding via large vision and language models. arXiv:2306.05424, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D Plumbley, Yuexian Zou, and Wenwu Wang. WavCaps : A ChatGPT -assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv:2303.17395, 2023
-
[36]
Voxceleb: Large-scale speaker verification in the wild
Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman. Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, 60: 0 101027, 2019
work page 2019
-
[37]
Joint speech recognition and audio captioning
Chaitanya Narisetty, Emiru Tsunoo, Xuankai Chang, Yosuke Kashiwagi, Michael Hentschel, and Shinji Watanabe. Joint speech recognition and audio captioning. In Proc. ICASSP, Singapore, 2022
work page 2022
-
[38]
OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Pilar Oplustil-Gallegos, Johannah O'Mahony, and Simon King. Comparing acoustic and textual representations of previous linguistic context for improving text-to-speech. In Proc. SSW, 2021
work page 2021
-
[40]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Proc. NeurIPS, New Orleans, 2022
work page 2022
-
[41]
Librispeech: An ASR corpus based on public domain audio books
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In Proc. ICASSP, South Brisbane, 2015
work page 2015
-
[42]
Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4 . arXiv:2304.03277, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Proc. ICML, Honolulu, 2023
work page 2023
-
[45]
PandaGPT: One Model To Instruction-Follow Them All
Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. PandaGPT : One model to instruction-follow them all. arXiv:2305.16355, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Learning features of music from scratch
John Thickstun, Zaid Harchaoui, and Sham Kakade. Learning features of music from scratch. In Proc. ICLR, Toulon, France, 2017
work page 2017
-
[48]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, et al. LLaMA : Open and efficient foundation language models. arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. NeurIPS, Long Beach, 2017
work page 2017
-
[50]
CoVoST 2 and massively multilingual speech translation
Changhan Wang, Anne Wu, and Juan Pino. CoVoST 2 and massively multilingual speech translation. In Proc. Interspeech, Brno, Czech Republic, 2021
work page 2021
-
[51]
Finetuned language models are zero-shot learners
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In Proc. ICLR, 2022 a
work page 2022
-
[52]
Emergent abilities of large language models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022 b
work page 2022
-
[54]
Emotion recognition by fusing time synchronous and time asynchronous representations
Wen Wu, Chao Zhang, and Philip C Woodland. Emotion recognition by fusing time synchronous and time asynchronous representations. In Proc. ICASSP, Toronto, Canada, 2021
work page 2021
-
[55]
Improving prosody modelling with cross-utterance BERT embeddings for end-to-end speech synthesis
Guanghui Xu, Wei Song, Zhengchen Zhang, Chao ZHang, Xiaodong He, and Bowen Zhou. Improving prosody modelling with cross-utterance BERT embeddings for end-to-end speech synthesis. In Proc. ICASSP, 2021
work page 2021
-
[56]
WikiQA : A challenge dataset for open-domain question answering
Yi Yang, Wen-tau Yih, and Christopher Meek. WikiQA : A challenge dataset for open-domain question answering. In Proc. EMNLP, Lisbon, Portugal, 2015
work page 2015
-
[58]
SpeechGPT : E mpowering large language models with intrinsic cross-modal conversational abilities
Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. SpeechGPT : E mpowering large language models with intrinsic cross-modal conversational abilities. arXiv:2305.11000, 2023 a
-
[59]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, and Lidong Bing. Video-LLaMA : An instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858, 2023 b
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Prosody modelling with pre-trained cross-utterance representations for improved speech synthesis
Ya-Jie Zhang, Chao Zhang, Wei Song, Zhengchen Zhang, Yonghui Wu, and Xiaodong He. Prosody modelling with pre-trained cross-utterance representations for improved speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31: 0 2812--2823, 2023 c
work page 2023
-
[61]
Learning video representations from large language models
Yue Zhao, Ishan Misra, Philipp Krähenbühl, and Rohit Girdhar. Learning video representations from large language models. In Proc. CVPR, New Orleans, 2022
work page 2022
-
[62]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4 : Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Language models are few-shot learners , author=. Proc. NeurIPS , address=
-
[64]
Mei, Xinhao and Meng, Chutong and Liu, Haohe and Kong, Qiuqiang and Ko, Tom and Zhao, Chengqi and Plumbley, Mark D and Zou, Yuexian and Wang, Wenwu , journal=
-
[65]
MusicLM: Generating Music From Text
Agostinelli, Andrea and Denk, Timo I and Borsos, Zal. arXiv:2301.11325 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[66]
Doh, SeungHeon and Choi, Keunwoo and Lee, Jongpil and Nam, Juhan , journal=
-
[67]
Improved automatic keyword extraction given more linguistic knowledge , author=. Proc. EMNLP , address=
-
[68]
arXiv preprint arXiv:2310.05863 , year=
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models , author=. arXiv preprint arXiv:2310.05863 , year=
-
[69]
Yang, Yi and Yih, Wen-tau and Meek, Christopher , booktitle=
-
[70]
Bastianelli, Emanuele and Vanzo, Andrea and Swietojanski, Pawel and Rieser, Verena , booktitle=
-
[71]
Learning features of music from scratch , author=. Proc. ICLR , address=
-
[72]
Computer Speech & Language , volume =
Arsha Nagrani and Joon Son Chung and Weidi Xie and Andrew Zisserman , title=. Computer Speech & Language , volume =. 2019 , publisher=
work page 2019
-
[73]
Cosentino, Joris and Pariente, Manuel and Cornell, Samuele and Deleforge, Antoine and Vincent, Emmanuel , journal=
-
[74]
Busso, Carlos and Bulut, Murtaza and Lee, Chi-Chun and Kazemzadeh, Abe and Mower, Emily and Kim, Samuel and Chang, Jeannette N and Lee, Sungbok and Narayanan, Shrikanth S , journal=. 2008 , publisher=
work page 2008
-
[75]
Changhan Wang and Anne Wu and Juan Pino , booktitle=
-
[76]
Transactions on Machine Learning Research , year=
Emergent Abilities of Large Language Models , author=. Transactions on Machine Learning Research , year=
-
[77]
Huang, Chien-yu and Lu, Ke-Han and Wang, Shih-Heng and Hsiao, Chi-Yuan and Kuan, Chun-Yi and Wu, Haibin and Arora, Siddhant and Chang, Kai-Wei and Shi, Jiatong and Peng, Yifan and others , journal=
-
[78]
Finetuned language models are zero-shot learners , author=. Proc. ICLR , year=
-
[79]
Drossos, Konstantinos and Lipping, Samuel and Virtanen, Tuomas , booktitle=. Clotho:
-
[80]
Kim, Chris Dongjoo and Kim, Byeongchang and Lee, Hyunmin and Kim, Gunhee , booktitle=
-
[81]
Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev , booktitle=. Librispeech:
-
[82]
Chen, Guoguo and Chai, Shuzhou and Wang, Guanbo and Du, Jiayu and Zhang, Wei-Qiang and Weng, Chao and Su, Dan and Povey, Daniel and Trmal, Jan and Zhang, Junbo and others , booktitle=
-
[83]
and Stoica, Ion and Xing, Eric P
Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month=. Vicuna: An Open-Source Chatbot Impressing
-
[84]
Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and others , journal=
-
[85]
Visual Instruction Tuning , author=. arXiv:2304.08485 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[86]
Scaling Instruction-Finetuned Language Models
Scaling instruction-finetuned language models , author=. arXiv:2210.11416 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[87]
Training language models to follow instructions with human feedback , author=. Proc. NeurIPS , address=
-
[88]
Peng, Baolin and Li, Chunyuan and He, Pengcheng and Galley, Michel and Gao, Jianfeng , journal=. Instruction tuning with
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.