arxiv: 2310.13289 · v2 · pith:ZPTH6N7Znew · submitted 2023-10-20 · 💻 cs.SD · cs.CL· eess.AS

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang , Wenyi Yu , Guangzhi Sun , Xianzhao Chen , Tian Tan , Wei Li , Lu Lu , Zejun Ma

show 1 more author

Chao Zhang

This is my paper

Pith reviewed 2026-05-18 02:24 UTC · model grok-4.3

classification 💻 cs.SD cs.CLeess.AS

keywords SALMONNmultimodal LLMaudio understandingspeech processingemergent abilitiesmusic captioninghearing abilitiesaudio encoders

0 comments

The pith

SALMONN integrates pre-trained speech and audio encoders with a large language model to enable direct processing and understanding of general audio inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds SALMONN by connecting speech and audio encoders to a text-based LLM in one multimodal model. This setup lets the model handle trained tasks such as automatic speech recognition, emotion recognition, speaker verification, and music captioning at competitive levels. It also produces emergent abilities on tasks never seen in training, including translation to untrained languages, spoken-query question answering, and audio-based storytelling. A reader would care because the work targets the basic ability of AI agents to perceive and reason about sounds in the physical world rather than relying on separate specialized tools for each audio type.

Core claim

SALMONN integrates a pre-trained text-based large language model with speech and audio encoders into a single multimodal model. The model directly processes general audio inputs consisting of speech, audio events, and music. It reaches competitive performance on trained tasks such as automatic speech recognition and translation, auditory question answering, emotion recognition, speaker verification, and music and audio captioning. It also exhibits emergent abilities unseen during training, such as speech translation to untrained languages, speech-based slot filling, spoken-query question answering, audio-based storytelling, and speech-audio co-reasoning. A novel few-shot activation tuning is

What carries the argument

The architecture that feeds features from pre-trained speech and audio encoders into the LLM, together with few-shot activation tuning to bring out cross-modal emergent abilities.

If this is right

A single model can replace multiple task-specific audio systems for speech recognition, translation, emotion detection, and captioning.
Emergent abilities such as translation to untrained languages and audio storytelling appear without explicit training for those tasks.
Few-shot activation tuning can bring out cross-modal abilities that were not present after standard training.
The model supports combined speech and audio inputs for co-reasoning tasks.
General auditory inputs become usable for question answering and slot filling without additional modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same encoder-integration pattern could be applied to add video or other sensory streams to create models with combined perceptual abilities.
If the emergent abilities prove robust, future audio AI work may need far less task-specific labeled data once rich encoder features are available.
Real deployments might allow AI systems to react to background sounds and environmental audio alongside spoken commands.
Controlled tests with overlapping sounds or added noise would check whether the reported generic abilities survive outside clean training conditions.

Load-bearing premise

Features from the pre-trained speech and audio encoders, once integrated with the LLM through the chosen architecture and training, produce genuine understanding of general auditory information rather than task-specific pattern matching.

What would settle it

If the model continues to succeed on the trained audio tasks but fails on a new type of auditory reasoning task that has no overlap with the training distribution, such as identifying and describing an unfamiliar environmental sound event in context, the claim of generic hearing abilities would not hold.

read the original abstract

Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music. In this paper, we propose SALMONN, a speech audio language music open neural network, built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model. SALMONN enables the LLM to directly process and understand general audio inputs and achieve competitive performances on a number of speech and audio tasks used in training, such as automatic speech recognition and translation, auditory-information-based question answering, emotion recognition, speaker verification, and music and audio captioning etc. SALMONN also has a diverse set of emergent abilities unseen in the training, which includes but is not limited to speech translation to untrained languages, speech-based slot filling, spoken-query-based question answering, audio-based storytelling, and speech audio co-reasoning etc. The presence of cross-modal emergent abilities is studied, and a novel few-shot activation tuning approach is proposed to activate such abilities. To our knowledge, SALMONN is the first model of its type and can be regarded as a step towards AI with generic hearing abilities. The source code, model checkpoints and data are available at https://github.com/bytedance/SALMONN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SALMONN shows a workable way to add general audio handling to an LLM using off-the-shelf encoders plus a few-shot tuning step that surfaces some cross-modal behaviors.

read the letter

Dear colleague, The main point with this SALMONN paper is that it shows you can connect speech and general audio encoders to a large language model and get it to handle a range of audio inputs, including some abilities that pop up without being directly trained for. They use Whisper for speech and BEATs for audio events, connect them through a Q-Former to the LLM, and do multi-task instruction tuning on a bunch of speech and audio tasks. Then they add this few-shot activation tuning to bring out the emergent stuff. The new part is really the overall setup for generic hearing and that tuning trick to activate cross-modal behaviors like speech translation to untrained languages or audio-based storytelling. It does well by reporting competitive numbers on the trained tasks such as ASR, translation, emotion recognition, speaker verification, and captioning. Releasing the code, models, and data is a plus for reproducibility. The architecture is straightforward and the results line up with what you'd expect from combining those pieces. The softer part is that the emergent abilities are illustrated with selected examples rather than thorough quantitative tests across many cases. It would be good to see more on how consistent those are or comparisons to simpler baselines. Since the encoders are pre-trained and the connector is a standard Q-Former, the core advance is in the training mixture and the activation method rather than inventing new modules. This paper is aimed at researchers working on multimodal extensions of LLMs, particularly those interested in audio and speech integration. A reader looking for practical ways to add hearing to language models would get value from the implementation details and the observed behaviors. It deserves peer review because it presents a concrete, open model that advances the idea of general audio understanding in LLMs, even if some claims need more backing in the evaluation. Cheers,

Referee Report

2 major / 2 minor

Summary. The paper introduces SALMONN, a multimodal model that integrates pre-trained Whisper speech and BEATs audio encoders with a text-based LLM via a Q-Former connector. It uses multi-task instruction tuning on a mixture of speech and audio tasks followed by few-shot activation tuning to achieve competitive results on in-distribution tasks including ASR, speech translation, auditory QA, emotion recognition, speaker verification, and music/audio captioning, while also exhibiting emergent cross-modal abilities such as translation to untrained languages, speech-based slot filling, spoken-query QA, audio storytelling, and speech-audio co-reasoning.

Significance. If the quantitative results and emergent-ability demonstrations hold under rigorous scrutiny, the work provides a concrete step toward generic auditory understanding in LLMs, extending their utility to real-world audio inputs beyond text. The open release of code, model checkpoints, and training data is a clear strength that supports reproducibility and follow-on research. The proposed activation-tuning method offers a practical technique for eliciting cross-modal generalization that may generalize to other multimodal settings.

major comments (2)

[Section 5] Section 5 (Experiments): the claim of 'competitive performances' on training tasks is not accompanied by per-task numerical tables comparing against strong baselines (e.g., Whisper-only, AudioLM, or prior multimodal LLMs); without these numbers and statistical significance tests, it is difficult to judge whether the observed gains are attributable to the architecture or to the scale of the training mixture.
[Section 4.3] Section 4.3 (Few-shot Activation Tuning): the procedure for selecting the few-shot examples and the precise mechanism by which they 'activate' emergent abilities is described at a high level; an ablation removing the activation-tuning stage or varying the number of shots would be required to establish that this step is load-bearing for the reported emergent behaviors rather than an artifact of prompt engineering.

minor comments (2)

[Figure 2] Figure 2 (model diagram): the flow from audio encoder outputs through the Q-Former to the LLM token embeddings is visually clear but lacks explicit dimension annotations on the connector layers, which would aid readers in reproducing the exact architecture.
[Section 3.2] Section 3.2: the description of the multi-task data mixture would benefit from an explicit table listing the proportion or number of examples per task type to clarify how the training distribution was balanced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment, the recommendation of minor revision, and the constructive comments on Sections 5 and 4.3. We address each point below and will update the manuscript accordingly to strengthen the presentation of results and the activation-tuning procedure.

read point-by-point responses

Referee: [Section 5] Section 5 (Experiments): the claim of 'competitive performances' on training tasks is not accompanied by per-task numerical tables comparing against strong baselines (e.g., Whisper-only, AudioLM, or prior multimodal LLMs); without these numbers and statistical significance tests, it is difficult to judge whether the observed gains are attributable to the architecture or to the scale of the training mixture.

Authors: We thank the referee for this observation. The current manuscript reports competitive results on the training tasks but presents them primarily through figures and selected comparisons rather than exhaustive per-task tables. We agree that adding systematic numerical tables against strong baselines (Whisper, AudioLM, and prior multimodal LLMs) together with statistical significance tests would make the claims more rigorous and help isolate the contribution of the proposed architecture from training data scale. In the revised version we will include such tables for ASR, speech translation, auditory QA, emotion recognition, speaker verification, and music/audio captioning, reporting means, standard deviations, and p-values or confidence intervals where appropriate. revision: yes
Referee: [Section 4.3] Section 4.3 (Few-shot Activation Tuning): the procedure for selecting the few-shot examples and the precise mechanism by which they 'activate' emergent abilities is described at a high level; an ablation removing the activation-tuning stage or varying the number of shots would be required to establish that this step is load-bearing for the reported emergent behaviors rather than an artifact of prompt engineering.

Authors: We agree that the description of few-shot example selection and the activation mechanism is currently high-level and that ablations are needed to demonstrate the contribution of this stage. In the revision we will expand Section 4.3 with a more detailed account of how the few-shot examples are chosen (including criteria and examples) and the hypothesized activation process. We will also add the requested ablations: (i) performance with the activation-tuning stage removed entirely, and (ii) performance when varying the number of shots (0, 1, 5, and 10). These experiments will help confirm that the reported emergent abilities depend on the activation-tuning step rather than prompt engineering alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical model construction

full rationale

The paper presents SALMONN as an engineering integration of pre-trained Whisper and BEATs encoders with an LLM via Q-Former and instruction tuning, followed by empirical evaluation on speech/audio tasks and observation of emergent behaviors. No mathematical derivation, first-principles equations, or claimed predictions are offered that could reduce to fitted inputs or self-citations by construction. All results follow from standard training and testing procedures on external datasets; the manuscript remains self-contained without invoking load-bearing self-referential steps of the enumerated kinds.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that pre-trained encoders supply usable auditory features and that joint training plus few-shot tuning produces genuine cross-modal understanding and emergence rather than superficial correlations.

axioms (1)

domain assumption Pre-trained speech and audio encoders capture features sufficient for general auditory understanding when connected to an LLM.
Invoked when claiming the model can directly process and understand general audio inputs.

pith-pipeline@v0.9.0 · 5807 in / 1204 out tokens · 43861 ms · 2026-05-18T02:24:06.864395+00:00 · methodology

discussion (0)

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models
cs.SD 2026-01 unverdicted novelty 7.0

TWNM framework equips audio-language models with spatial scene analysis via FOA simulation and metadata-grounded training, reaching 70.8% accuracy on a new ASA benchmark.
M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation
cs.CL 2025-12 unverdicted novelty 7.0

M³KG-RAG improves multimodal reasoning in large language models by constructing multi-hop knowledge graphs and selectively pruning retrieved context with GRASP.
Protecting Bystander Privacy via Selective Hearing in Audio LLMs
cs.SD 2025-12 conditional novelty 7.0

Audio LLMs leak bystander speech; SH-Bench benchmark and BPFT fine-tuning raise selective accuracy by 47% and selective efficacy by 16% over Gemini 2.5 Pro.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
cs.CV 2025-02 unverdicted novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
VoiceBench: Benchmarking LLM-Based Voice Assistants
cs.CL 2024-10 unverdicted novelty 7.0

VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.
HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models
cs.SD 2026-04 unverdicted novelty 6.0

HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs
cs.SD 2026-04 unverdicted novelty 6.0

NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.
QoS-QoE Translation with Large Language Model
cs.MM 2026-04 unverdicted novelty 6.0

A new QoS-QoE Translation dataset is constructed from multimedia literature and fine-tuned LLMs demonstrate strong performance on bidirectional continuous and discrete QoS-QoE predictions.
MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages
cs.CL 2025-12 conditional novelty 6.0

MCAT scales MLLMs to many-to-many speech translation across 70 languages via curriculum learning and a 30-token speech adapter, surpassing prior SOTA on FLEURS while improving speed.
Step-Audio 2 Technical Report
cs.CL 2025-07 unverdicted novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and c...
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
cs.CV 2025-05 unverdicted novelty 6.0

Spatial-MLLM boosts MLLM spatial intelligence from 2D inputs via dual encoders initialized from geometry models plus space-aware sampling, claiming state-of-the-art results.
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
eess.AS 2026-04 unverdicted novelty 5.0

Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition
cs.CL 2026-04 unverdicted novelty 5.0

The authors introduce LLM-based semantic judgment and an agentic interaction loop that improves semantic fidelity and enables iterative corrections in automatic speech recognition beyond traditional WER.
Kimi-Audio Technical Report
eess.AS 2025-04 unverdicted novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss
cs.CL 2026-04 unverdicted novelty 4.0

A cross-modal attention refinement module plus hybrid loss improves robustness of audio-text retrieval on noisy and long-form audio.

Reference graph

Works this paper leans on

118 extracted references · 118 canonical work pages · cited by 16 Pith papers · 13 internal anchors

[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, et al. Flamingo: a visual language model for few-shot learning. In Proc. NeurIPS, New Orleans, 2022

work page 2022
[3]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. PaLM 2 technical report. arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

SLURP : A spoken language understanding resource package

Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojanski, and Verena Rieser. SLURP : A spoken language understanding resource package. In Proc. EMNLP, 2020

work page 2020
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Proc. NeurIPS, New Orleans, 2020

work page 2020
[6]

IEMOCAP : Interactive emotional dyadic motion capture database

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. IEMOCAP : Interactive emotional dyadic motion capture database. Language resources and evaluation, 42: 0 335--359, 2008

work page 2008
[7]

X-LLM : B ootstrapping advanced large language models by treating multi-modalities as foreign languages

Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, Jing Shi, Shuang Xu, and Bo Xu. X-LLM : B ootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv:2305.04160, 2023 a

work page arXiv 2023
[8]

VideoLLM : Modeling video sequence with large language models

Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, et al. VideoLLM : Modeling video sequence with large language models. arXiv:2305.13292, 2023 b

work page arXiv 2023
[9]

GigaSpeech : A n evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio

Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al. GigaSpeech : A n evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio. In Proc. Interspeech, Brno, 2021

work page 2021
[10]

WavLM : Large-scale self-supervised pre-training for full stack speech processing

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. WavLM : Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16 0 (6): 0 1505--1518, 2022

work page 2022
[11]

BEATs : Audio pre-training with acoustic tokenizers

Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, and Furu Wei. BEATs : Audio pre-training with acoustic tokenizers. In Proc. ICML, Honolulu, 2023 c

work page 2023
[12]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing GPT-4 with 90\ ChatGPT quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/

work page 2023
[14]

LibriMix : An open-source dataset for generalizable speech separation

Joris Cosentino, Manuel Pariente, Samuele Cornell, Antoine Deleforge, and Emmanuel Vincent. LibriMix : An open-source dataset for generalizable speech separation. arXiv:2005.11262, 2020

work page arXiv 2005
[15]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, et al. InstructBLIP : Towards general-purpose vision-language models with instruction tuning. arXiv:2305.06500, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

LP-MusicCaps : LLM -based pseudo music captioning

SeungHeon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. LP-MusicCaps : LLM -based pseudo music captioning. arXiv:2307.16372, 2023

work page arXiv 2023
[17]

Clotho: An audio captioning dataset

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: An audio captioning dataset. In Proc. ICASSP, Barcelona, 2020

work page 2020
[18]

GLM : General language model pretraining with autoregressive blank infilling

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. GLM : General language model pretraining with autoregressive blank infilling. In Proc. ACL, Dublin, Ireland, 2022

work page 2022
[20]

Whisper-AT : Noise-robust automatic speech recognizers are also strong general audio event taggers

Yuan Gong, Sameer Khurana, Leonid Karlinsky, and James Glass. Whisper-AT : Noise-robust automatic speech recognizers are also strong general audio event taggers. In Proc. Interspeech, Dublin, Ireland, 2023 a

work page 2023
[22]

Soong, Lei He, and Lei Xie

Haohan Guo, Shaofei Zhang, Frank K. Soong, Lei He, and Lei Xie. Conversational end-to-end TTS for voice agents. In Proc. SLT, 2021

work page 2021
[23]

LoRA : Low-Rank Adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA : Low-Rank Adaptation of large language models. In Proc. ICLR, 2022

work page 2022
[24]

Dynamic-SUPERB : T owards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech

Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, et al. Dynamic-SUPERB : T owards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech. arXiv:2309.09510, 2023 a

work page arXiv 2023
[25]

AudioGPT : U nderstanding and generating speech, music, sound, and talking head

Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, et al. AudioGPT : U nderstanding and generating speech, music, sound, and talking head. arXiv:2304.12995, 2023 b

work page arXiv 2023
[26]

Adapting self-supervised models to multi-talker speech recognition using speaker embeddings

Zili Huang, Desh Raj, Paola Garc \' a, and Sanjeev Khudanpur. Adapting self-supervised models to multi-talker speech recognition using speaker embeddings. In Proc. ICASSP, Rhodes, Greek, 2023 c

work page 2023
[27]

Improved automatic keyword extraction given more linguistic knowledge

Anette Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proc. EMNLP, Sapporo, Japan, 2003

work page 2003
[28]

AudioCaps : G enerating captions for audios in the wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. AudioCaps : G enerating captions for audios in the wild. In Proc. NAACL-HLT, Minneapolis, 2019

work page 2019
[29]

The L ombard sign and the role of hearing in speech

Harlen Lane and Bernard Tranel. The L ombard sign and the role of hearing in speech. Journal of Speech Hearing Research, 14: 0 677--709, 1971

work page 1971
[30]

BLIP-2 : Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2 : Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proc. ICML, Hawaii, 2023 a

work page 2023
[31]

MERT : A coustic music understanding model with large-scale self-supervised training

Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghua Lin, Anton Ragni, Emmanouil Benetos, Norbert Gyenge, et al. MERT : A coustic music understanding model with large-scale self-supervised training. arXiv:2306.00107, 2023 b

work page arXiv 2023
[32]

Music understanding LLaMA : A dvancing text-to-music generation with question answering and captioning

Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, and Ying Shan. Music understanding LLaMA : A dvancing text-to-music generation with question answering and captioning. arXiv:2308.11276, 2023

work page arXiv 2023
[33]

Macaw-LLM : Multi-modal language modeling with image, audio, video, and text integration

Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, et al. Macaw-LLM : Multi-modal language modeling with image, audio, video, and text integration. arXiv:2306.09093, 2023

work page arXiv 2023
[34]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-ChatGPT : Towards detailed video understanding via large vision and language models. arXiv:2306.05424, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

WavCaps : A ChatGPT -assisted weakly-labelled audio captioning dataset for audio-language multimodal research

Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D Plumbley, Yuexian Zou, and Wenwu Wang. WavCaps : A ChatGPT -assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv:2303.17395, 2023

work page arXiv 2023
[36]

Voxceleb: Large-scale speaker verification in the wild

Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman. Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, 60: 0 101027, 2019

work page 2019
[37]

Joint speech recognition and audio captioning

Chaitanya Narisetty, Emiru Tsunoo, Xuankai Chang, Yosuke Kashiwagi, Michael Hentschel, and Shinji Watanabe. Joint speech recognition and audio captioning. In Proc. ICASSP, Singapore, 2022

work page 2022
[38]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Comparing acoustic and textual representations of previous linguistic context for improving text-to-speech

Pilar Oplustil-Gallegos, Johannah O'Mahony, and Simon King. Comparing acoustic and textual representations of previous linguistic context for improving text-to-speech. In Proc. SSW, 2021

work page 2021
[40]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Proc. NeurIPS, New Orleans, 2022

work page 2022
[41]

Librispeech: An ASR corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In Proc. ICASSP, South Brisbane, 2015

work page 2015
[42]

Instruction Tuning with GPT-4

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4 . arXiv:2304.03277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Proc. ICML, Honolulu, 2023

work page 2023
[45]

PandaGPT: One Model To Instruction-Follow Them All

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. PandaGPT : One model to instruction-follow them all. arXiv:2305.16355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Learning features of music from scratch

John Thickstun, Zaid Harchaoui, and Sham Kakade. Learning features of music from scratch. In Proc. ICLR, Toulon, France, 2017

work page 2017
[48]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, et al. LLaMA : Open and efficient foundation language models. arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. NeurIPS, Long Beach, 2017

work page 2017
[50]

CoVoST 2 and massively multilingual speech translation

Changhan Wang, Anne Wu, and Juan Pino. CoVoST 2 and massively multilingual speech translation. In Proc. Interspeech, Brno, Czech Republic, 2021

work page 2021
[51]

Finetuned language models are zero-shot learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In Proc. ICLR, 2022 a

work page 2022
[52]

Emergent abilities of large language models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022 b

work page 2022
[54]

Emotion recognition by fusing time synchronous and time asynchronous representations

Wen Wu, Chao Zhang, and Philip C Woodland. Emotion recognition by fusing time synchronous and time asynchronous representations. In Proc. ICASSP, Toronto, Canada, 2021

work page 2021
[55]

Improving prosody modelling with cross-utterance BERT embeddings for end-to-end speech synthesis

Guanghui Xu, Wei Song, Zhengchen Zhang, Chao ZHang, Xiaodong He, and Bowen Zhou. Improving prosody modelling with cross-utterance BERT embeddings for end-to-end speech synthesis. In Proc. ICASSP, 2021

work page 2021
[56]

WikiQA : A challenge dataset for open-domain question answering

Yi Yang, Wen-tau Yih, and Christopher Meek. WikiQA : A challenge dataset for open-domain question answering. In Proc. EMNLP, Lisbon, Portugal, 2015

work page 2015
[58]

SpeechGPT : E mpowering large language models with intrinsic cross-modal conversational abilities

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. SpeechGPT : E mpowering large language models with intrinsic cross-modal conversational abilities. arXiv:2305.11000, 2023 a

work page arXiv 2023
[59]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-LLaMA : An instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Prosody modelling with pre-trained cross-utterance representations for improved speech synthesis

Ya-Jie Zhang, Chao Zhang, Wei Song, Zhengchen Zhang, Yonghui Wu, and Xiaodong He. Prosody modelling with pre-trained cross-utterance representations for improved speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31: 0 2812--2823, 2023 c

work page 2023
[61]

Learning video representations from large language models

Yue Zhao, Ishan Misra, Philipp Krähenbühl, and Rohit Girdhar. Learning video representations from large language models. In Proc. CVPR, New Orleans, 2022

work page 2022
[62]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4 : Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Language models are few-shot learners , author=. Proc. NeurIPS , address=

work page
[64]

Mei, Xinhao and Meng, Chutong and Liu, Haohe and Kong, Qiuqiang and Ko, Tom and Zhao, Chengqi and Plumbley, Mark D and Zou, Yuexian and Wang, Wenwu , journal=

work page
[65]

MusicLM: Generating Music From Text

Agostinelli, Andrea and Denk, Timo I and Borsos, Zal. arXiv:2301.11325 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

Doh, SeungHeon and Choi, Keunwoo and Lee, Jongpil and Nam, Juhan , journal=

work page
[67]

Improved automatic keyword extraction given more linguistic knowledge , author=. Proc. EMNLP , address=

work page
[68]

arXiv preprint arXiv:2310.05863 , year=

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models , author=. arXiv preprint arXiv:2310.05863 , year=

work page arXiv
[69]

Yang, Yi and Yih, Wen-tau and Meek, Christopher , booktitle=

work page
[70]

Bastianelli, Emanuele and Vanzo, Andrea and Swietojanski, Pawel and Rieser, Verena , booktitle=

work page
[71]

Learning features of music from scratch , author=. Proc. ICLR , address=

work page
[72]

Computer Speech & Language , volume =

Arsha Nagrani and Joon Son Chung and Weidi Xie and Andrew Zisserman , title=. Computer Speech & Language , volume =. 2019 , publisher=

work page 2019
[73]

Cosentino, Joris and Pariente, Manuel and Cornell, Samuele and Deleforge, Antoine and Vincent, Emmanuel , journal=

work page
[74]

2008 , publisher=

Busso, Carlos and Bulut, Murtaza and Lee, Chi-Chun and Kazemzadeh, Abe and Mower, Emily and Kim, Samuel and Chang, Jeannette N and Lee, Sungbok and Narayanan, Shrikanth S , journal=. 2008 , publisher=

work page 2008
[75]

Changhan Wang and Anne Wu and Juan Pino , booktitle=

work page
[76]

Transactions on Machine Learning Research , year=

Emergent Abilities of Large Language Models , author=. Transactions on Machine Learning Research , year=

work page
[77]

Huang, Chien-yu and Lu, Ke-Han and Wang, Shih-Heng and Hsiao, Chi-Yuan and Kuan, Chun-Yi and Wu, Haibin and Arora, Siddhant and Chang, Kai-Wei and Shi, Jiatong and Peng, Yifan and others , journal=

work page
[78]

Finetuned language models are zero-shot learners , author=. Proc. ICLR , year=

work page
[79]

Drossos, Konstantinos and Lipping, Samuel and Virtanen, Tuomas , booktitle=. Clotho:

work page
[80]

Kim, Chris Dongjoo and Kim, Byeongchang and Lee, Hyunmin and Kim, Gunhee , booktitle=

work page
[81]

Librispeech:

Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev , booktitle=. Librispeech:

work page
[82]

Chen, Guoguo and Chai, Shuzhou and Wang, Guanbo and Du, Jiayu and Zhang, Wei-Qiang and Weng, Chao and Su, Dan and Povey, Daniel and Trmal, Jan and Zhang, Junbo and others , booktitle=

work page
[83]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month=. Vicuna: An Open-Source Chatbot Impressing

work page
[84]

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and others , journal=

work page
[85]

Visual Instruction Tuning

Visual Instruction Tuning , author=. arXiv:2304.08485 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[86]

Scaling Instruction-Finetuned Language Models

Scaling instruction-finetuned language models , author=. arXiv:2210.11416 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[87]

Training language models to follow instructions with human feedback , author=. Proc. NeurIPS , address=

work page
[88]

Instruction tuning with

Peng, Baolin and Li, Chunyuan and He, Pengcheng and Galley, Michel and Gao, Jianfeng , journal=. Instruction tuning with

work page

Showing first 80 references.