pith. machine review for the scientific record. sign in

arxiv: 2310.13289 · v2 · pith:ZPTH6N7Znew · submitted 2023-10-20 · 💻 cs.SD · cs.CL· eess.AS

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Pith reviewed 2026-05-18 02:24 UTC · model grok-4.3

classification 💻 cs.SD cs.CLeess.AS
keywords SALMONNmultimodal LLMaudio understandingspeech processingemergent abilitiesmusic captioninghearing abilitiesaudio encoders
0
0 comments X

The pith

SALMONN integrates pre-trained speech and audio encoders with a large language model to enable direct processing and understanding of general audio inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds SALMONN by connecting speech and audio encoders to a text-based LLM in one multimodal model. This setup lets the model handle trained tasks such as automatic speech recognition, emotion recognition, speaker verification, and music captioning at competitive levels. It also produces emergent abilities on tasks never seen in training, including translation to untrained languages, spoken-query question answering, and audio-based storytelling. A reader would care because the work targets the basic ability of AI agents to perceive and reason about sounds in the physical world rather than relying on separate specialized tools for each audio type.

Core claim

SALMONN integrates a pre-trained text-based large language model with speech and audio encoders into a single multimodal model. The model directly processes general audio inputs consisting of speech, audio events, and music. It reaches competitive performance on trained tasks such as automatic speech recognition and translation, auditory question answering, emotion recognition, speaker verification, and music and audio captioning. It also exhibits emergent abilities unseen during training, such as speech translation to untrained languages, speech-based slot filling, spoken-query question answering, audio-based storytelling, and speech-audio co-reasoning. A novel few-shot activation tuning is

What carries the argument

The architecture that feeds features from pre-trained speech and audio encoders into the LLM, together with few-shot activation tuning to bring out cross-modal emergent abilities.

If this is right

  • A single model can replace multiple task-specific audio systems for speech recognition, translation, emotion detection, and captioning.
  • Emergent abilities such as translation to untrained languages and audio storytelling appear without explicit training for those tasks.
  • Few-shot activation tuning can bring out cross-modal abilities that were not present after standard training.
  • The model supports combined speech and audio inputs for co-reasoning tasks.
  • General auditory inputs become usable for question answering and slot filling without additional modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same encoder-integration pattern could be applied to add video or other sensory streams to create models with combined perceptual abilities.
  • If the emergent abilities prove robust, future audio AI work may need far less task-specific labeled data once rich encoder features are available.
  • Real deployments might allow AI systems to react to background sounds and environmental audio alongside spoken commands.
  • Controlled tests with overlapping sounds or added noise would check whether the reported generic abilities survive outside clean training conditions.

Load-bearing premise

Features from the pre-trained speech and audio encoders, once integrated with the LLM through the chosen architecture and training, produce genuine understanding of general auditory information rather than task-specific pattern matching.

What would settle it

If the model continues to succeed on the trained audio tasks but fails on a new type of auditory reasoning task that has no overlap with the training distribution, such as identifying and describing an unfamiliar environmental sound event in context, the claim of generic hearing abilities would not hold.

read the original abstract

Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music. In this paper, we propose SALMONN, a speech audio language music open neural network, built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model. SALMONN enables the LLM to directly process and understand general audio inputs and achieve competitive performances on a number of speech and audio tasks used in training, such as automatic speech recognition and translation, auditory-information-based question answering, emotion recognition, speaker verification, and music and audio captioning etc. SALMONN also has a diverse set of emergent abilities unseen in the training, which includes but is not limited to speech translation to untrained languages, speech-based slot filling, spoken-query-based question answering, audio-based storytelling, and speech audio co-reasoning etc. The presence of cross-modal emergent abilities is studied, and a novel few-shot activation tuning approach is proposed to activate such abilities. To our knowledge, SALMONN is the first model of its type and can be regarded as a step towards AI with generic hearing abilities. The source code, model checkpoints and data are available at https://github.com/bytedance/SALMONN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SALMONN, a multimodal model that integrates pre-trained Whisper speech and BEATs audio encoders with a text-based LLM via a Q-Former connector. It uses multi-task instruction tuning on a mixture of speech and audio tasks followed by few-shot activation tuning to achieve competitive results on in-distribution tasks including ASR, speech translation, auditory QA, emotion recognition, speaker verification, and music/audio captioning, while also exhibiting emergent cross-modal abilities such as translation to untrained languages, speech-based slot filling, spoken-query QA, audio storytelling, and speech-audio co-reasoning.

Significance. If the quantitative results and emergent-ability demonstrations hold under rigorous scrutiny, the work provides a concrete step toward generic auditory understanding in LLMs, extending their utility to real-world audio inputs beyond text. The open release of code, model checkpoints, and training data is a clear strength that supports reproducibility and follow-on research. The proposed activation-tuning method offers a practical technique for eliciting cross-modal generalization that may generalize to other multimodal settings.

major comments (2)
  1. [Section 5] Section 5 (Experiments): the claim of 'competitive performances' on training tasks is not accompanied by per-task numerical tables comparing against strong baselines (e.g., Whisper-only, AudioLM, or prior multimodal LLMs); without these numbers and statistical significance tests, it is difficult to judge whether the observed gains are attributable to the architecture or to the scale of the training mixture.
  2. [Section 4.3] Section 4.3 (Few-shot Activation Tuning): the procedure for selecting the few-shot examples and the precise mechanism by which they 'activate' emergent abilities is described at a high level; an ablation removing the activation-tuning stage or varying the number of shots would be required to establish that this step is load-bearing for the reported emergent behaviors rather than an artifact of prompt engineering.
minor comments (2)
  1. [Figure 2] Figure 2 (model diagram): the flow from audio encoder outputs through the Q-Former to the LLM token embeddings is visually clear but lacks explicit dimension annotations on the connector layers, which would aid readers in reproducing the exact architecture.
  2. [Section 3.2] Section 3.2: the description of the multi-task data mixture would benefit from an explicit table listing the proportion or number of examples per task type to clarify how the training distribution was balanced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment, the recommendation of minor revision, and the constructive comments on Sections 5 and 4.3. We address each point below and will update the manuscript accordingly to strengthen the presentation of results and the activation-tuning procedure.

read point-by-point responses
  1. Referee: [Section 5] Section 5 (Experiments): the claim of 'competitive performances' on training tasks is not accompanied by per-task numerical tables comparing against strong baselines (e.g., Whisper-only, AudioLM, or prior multimodal LLMs); without these numbers and statistical significance tests, it is difficult to judge whether the observed gains are attributable to the architecture or to the scale of the training mixture.

    Authors: We thank the referee for this observation. The current manuscript reports competitive results on the training tasks but presents them primarily through figures and selected comparisons rather than exhaustive per-task tables. We agree that adding systematic numerical tables against strong baselines (Whisper, AudioLM, and prior multimodal LLMs) together with statistical significance tests would make the claims more rigorous and help isolate the contribution of the proposed architecture from training data scale. In the revised version we will include such tables for ASR, speech translation, auditory QA, emotion recognition, speaker verification, and music/audio captioning, reporting means, standard deviations, and p-values or confidence intervals where appropriate. revision: yes

  2. Referee: [Section 4.3] Section 4.3 (Few-shot Activation Tuning): the procedure for selecting the few-shot examples and the precise mechanism by which they 'activate' emergent abilities is described at a high level; an ablation removing the activation-tuning stage or varying the number of shots would be required to establish that this step is load-bearing for the reported emergent behaviors rather than an artifact of prompt engineering.

    Authors: We agree that the description of few-shot example selection and the activation mechanism is currently high-level and that ablations are needed to demonstrate the contribution of this stage. In the revision we will expand Section 4.3 with a more detailed account of how the few-shot examples are chosen (including criteria and examples) and the hypothesized activation process. We will also add the requested ablations: (i) performance with the activation-tuning stage removed entirely, and (ii) performance when varying the number of shots (0, 1, 5, and 10). These experiments will help confirm that the reported emergent abilities depend on the activation-tuning step rather than prompt engineering alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical model construction

full rationale

The paper presents SALMONN as an engineering integration of pre-trained Whisper and BEATs encoders with an LLM via Q-Former and instruction tuning, followed by empirical evaluation on speech/audio tasks and observation of emergent behaviors. No mathematical derivation, first-principles equations, or claimed predictions are offered that could reduce to fitted inputs or self-citations by construction. All results follow from standard training and testing procedures on external datasets; the manuscript remains self-contained without invoking load-bearing self-referential steps of the enumerated kinds.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that pre-trained encoders supply usable auditory features and that joint training plus few-shot tuning produces genuine cross-modal understanding and emergence rather than superficial correlations.

axioms (1)
  • domain assumption Pre-trained speech and audio encoders capture features sufficient for general auditory understanding when connected to an LLM.
    Invoked when claiming the model can directly process and understand general audio inputs.

pith-pipeline@v0.9.0 · 5807 in / 1204 out tokens · 43861 ms · 2026-05-18T02:24:06.864395+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.

  2. The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models

    cs.SD 2026-01 unverdicted novelty 7.0

    TWNM framework equips audio-language models with spatial scene analysis via FOA simulation and metadata-grounded training, reaching 70.8% accuracy on a new ASA benchmark.

  3. M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation

    cs.CL 2025-12 unverdicted novelty 7.0

    M³KG-RAG improves multimodal reasoning in large language models by constructing multi-hop knowledge graphs and selectively pruning retrieved context with GRASP.

  4. Protecting Bystander Privacy via Selective Hearing in Audio LLMs

    cs.SD 2025-12 conditional novelty 7.0

    Audio LLMs leak bystander speech; SH-Bench benchmark and BPFT fine-tuning raise selective accuracy by 47% and selective efficacy by 16% over Gemini 2.5 Pro.

  5. WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

    cs.CV 2025-02 unverdicted novelty 7.0

    WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.

  6. VoiceBench: Benchmarking LLM-Based Voice Assistants

    cs.CL 2024-10 unverdicted novelty 7.0

    VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.

  7. HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

    cs.SD 2026-04 unverdicted novelty 6.0

    HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.

  8. Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs

    cs.SD 2026-04 unverdicted novelty 6.0

    NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.

  9. QoS-QoE Translation with Large Language Model

    cs.MM 2026-04 unverdicted novelty 6.0

    A new QoS-QoE Translation dataset is constructed from multimedia literature and fine-tuned LLMs demonstrate strong performance on bidirectional continuous and discrete QoS-QoE predictions.

  10. MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages

    cs.CL 2025-12 conditional novelty 6.0

    MCAT scales MLLMs to many-to-many speech translation across 70 languages via curriculum learning and a 30-token speech adapter, surpassing prior SOTA on FLEURS while improving speed.

  11. Step-Audio 2 Technical Report

    cs.CL 2025-07 unverdicted novelty 6.0

    Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and c...

  12. Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

    cs.CV 2025-05 unverdicted novelty 6.0

    Spatial-MLLM boosts MLLM spatial intelligence from 2D inputs via dual encoders initialized from geometry models plus space-aware sampling, claiming state-of-the-art results.

  13. Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

    eess.AS 2026-04 unverdicted novelty 5.0

    Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.

  14. Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition

    cs.CL 2026-04 unverdicted novelty 5.0

    The authors introduce LLM-based semantic judgment and an agentic interaction loop that improves semantic fidelity and enables iterative corrections in automatic speech recognition beyond traditional WER.

  15. Kimi-Audio Technical Report

    eess.AS 2025-04 unverdicted novelty 5.0

    Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...

  16. Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss

    cs.CL 2026-04 unverdicted novelty 4.0

    A cross-modal attention refinement module plus hybrid loss improves robustness of audio-text retrieval on noisy and long-form audio.

Reference graph

Works this paper leans on

118 extracted references · 118 canonical work pages · cited by 16 Pith papers · 13 internal anchors

  1. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, et al. Flamingo: a visual language model for few-shot learning. In Proc. NeurIPS, New Orleans, 2022

  2. [3]

    PaLM 2 Technical Report

    Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. PaLM 2 technical report. arXiv:2305.10403, 2023

  3. [4]

    SLURP : A spoken language understanding resource package

    Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojanski, and Verena Rieser. SLURP : A spoken language understanding resource package. In Proc. EMNLP, 2020

  4. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Proc. NeurIPS, New Orleans, 2020

  5. [6]

    IEMOCAP : Interactive emotional dyadic motion capture database

    Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. IEMOCAP : Interactive emotional dyadic motion capture database. Language resources and evaluation, 42: 0 335--359, 2008

  6. [7]

    X-LLM : B ootstrapping advanced large language models by treating multi-modalities as foreign languages

    Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, Jing Shi, Shuang Xu, and Bo Xu. X-LLM : B ootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv:2305.04160, 2023 a

  7. [8]

    VideoLLM : Modeling video sequence with large language models

    Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, et al. VideoLLM : Modeling video sequence with large language models. arXiv:2305.13292, 2023 b

  8. [9]

    GigaSpeech : A n evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio

    Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al. GigaSpeech : A n evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio. In Proc. Interspeech, Brno, 2021

  9. [10]

    WavLM : Large-scale self-supervised pre-training for full stack speech processing

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. WavLM : Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16 0 (6): 0 1505--1518, 2022

  10. [11]

    BEATs : Audio pre-training with acoustic tokenizers

    Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, and Furu Wei. BEATs : Audio pre-training with acoustic tokenizers. In Proc. ICML, Honolulu, 2023 c

  11. [12]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing GPT-4 with 90\ ChatGPT quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/

  12. [14]

    LibriMix : An open-source dataset for generalizable speech separation

    Joris Cosentino, Manuel Pariente, Samuele Cornell, Antoine Deleforge, and Emmanuel Vincent. LibriMix : An open-source dataset for generalizable speech separation. arXiv:2005.11262, 2020

  13. [15]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, et al. InstructBLIP : Towards general-purpose vision-language models with instruction tuning. arXiv:2305.06500, 2023

  14. [16]

    LP-MusicCaps : LLM -based pseudo music captioning

    SeungHeon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. LP-MusicCaps : LLM -based pseudo music captioning. arXiv:2307.16372, 2023

  15. [17]

    Clotho: An audio captioning dataset

    Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: An audio captioning dataset. In Proc. ICASSP, Barcelona, 2020

  16. [18]

    GLM : General language model pretraining with autoregressive blank infilling

    Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. GLM : General language model pretraining with autoregressive blank infilling. In Proc. ACL, Dublin, Ireland, 2022

  17. [20]

    Whisper-AT : Noise-robust automatic speech recognizers are also strong general audio event taggers

    Yuan Gong, Sameer Khurana, Leonid Karlinsky, and James Glass. Whisper-AT : Noise-robust automatic speech recognizers are also strong general audio event taggers. In Proc. Interspeech, Dublin, Ireland, 2023 a

  18. [22]

    Soong, Lei He, and Lei Xie

    Haohan Guo, Shaofei Zhang, Frank K. Soong, Lei He, and Lei Xie. Conversational end-to-end TTS for voice agents. In Proc. SLT, 2021

  19. [23]

    LoRA : Low-Rank Adaptation of large language models

    Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA : Low-Rank Adaptation of large language models. In Proc. ICLR, 2022

  20. [24]

    Dynamic-SUPERB : T owards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech

    Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, et al. Dynamic-SUPERB : T owards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech. arXiv:2309.09510, 2023 a

  21. [25]

    AudioGPT : U nderstanding and generating speech, music, sound, and talking head

    Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, et al. AudioGPT : U nderstanding and generating speech, music, sound, and talking head. arXiv:2304.12995, 2023 b

  22. [26]

    Adapting self-supervised models to multi-talker speech recognition using speaker embeddings

    Zili Huang, Desh Raj, Paola Garc \' a, and Sanjeev Khudanpur. Adapting self-supervised models to multi-talker speech recognition using speaker embeddings. In Proc. ICASSP, Rhodes, Greek, 2023 c

  23. [27]

    Improved automatic keyword extraction given more linguistic knowledge

    Anette Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proc. EMNLP, Sapporo, Japan, 2003

  24. [28]

    AudioCaps : G enerating captions for audios in the wild

    Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. AudioCaps : G enerating captions for audios in the wild. In Proc. NAACL-HLT, Minneapolis, 2019

  25. [29]

    The L ombard sign and the role of hearing in speech

    Harlen Lane and Bernard Tranel. The L ombard sign and the role of hearing in speech. Journal of Speech Hearing Research, 14: 0 677--709, 1971

  26. [30]

    BLIP-2 : Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2 : Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proc. ICML, Hawaii, 2023 a

  27. [31]

    MERT : A coustic music understanding model with large-scale self-supervised training

    Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghua Lin, Anton Ragni, Emmanouil Benetos, Norbert Gyenge, et al. MERT : A coustic music understanding model with large-scale self-supervised training. arXiv:2306.00107, 2023 b

  28. [32]

    Music understanding LLaMA : A dvancing text-to-music generation with question answering and captioning

    Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, and Ying Shan. Music understanding LLaMA : A dvancing text-to-music generation with question answering and captioning. arXiv:2308.11276, 2023

  29. [33]

    Macaw-LLM : Multi-modal language modeling with image, audio, video, and text integration

    Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, et al. Macaw-LLM : Multi-modal language modeling with image, audio, video, and text integration. arXiv:2306.09093, 2023

  30. [34]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-ChatGPT : Towards detailed video understanding via large vision and language models. arXiv:2306.05424, 2023

  31. [35]

    WavCaps : A ChatGPT -assisted weakly-labelled audio captioning dataset for audio-language multimodal research

    Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D Plumbley, Yuexian Zou, and Wenwu Wang. WavCaps : A ChatGPT -assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv:2303.17395, 2023

  32. [36]

    Voxceleb: Large-scale speaker verification in the wild

    Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman. Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, 60: 0 101027, 2019

  33. [37]

    Joint speech recognition and audio captioning

    Chaitanya Narisetty, Emiru Tsunoo, Xuankai Chang, Yosuke Kashiwagi, Michael Hentschel, and Shinji Watanabe. Joint speech recognition and audio captioning. In Proc. ICASSP, Singapore, 2022

  34. [38]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023

  35. [39]

    Comparing acoustic and textual representations of previous linguistic context for improving text-to-speech

    Pilar Oplustil-Gallegos, Johannah O'Mahony, and Simon King. Comparing acoustic and textual representations of previous linguistic context for improving text-to-speech. In Proc. SSW, 2021

  36. [40]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Proc. NeurIPS, New Orleans, 2022

  37. [41]

    Librispeech: An ASR corpus based on public domain audio books

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In Proc. ICASSP, South Brisbane, 2015

  38. [42]

    Instruction Tuning with GPT-4

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4 . arXiv:2304.03277, 2023

  39. [43]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Proc. ICML, Honolulu, 2023

  40. [45]

    PandaGPT: One Model To Instruction-Follow Them All

    Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. PandaGPT : One model to instruction-follow them all. arXiv:2305.16355, 2023

  41. [47]

    Learning features of music from scratch

    John Thickstun, Zaid Harchaoui, and Sham Kakade. Learning features of music from scratch. In Proc. ICLR, Toulon, France, 2017

  42. [48]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, et al. LLaMA : Open and efficient foundation language models. arXiv:2302.13971, 2023

  43. [49]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. NeurIPS, Long Beach, 2017

  44. [50]

    CoVoST 2 and massively multilingual speech translation

    Changhan Wang, Anne Wu, and Juan Pino. CoVoST 2 and massively multilingual speech translation. In Proc. Interspeech, Brno, Czech Republic, 2021

  45. [51]

    Finetuned language models are zero-shot learners

    Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In Proc. ICLR, 2022 a

  46. [52]

    Emergent abilities of large language models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022 b

  47. [54]

    Emotion recognition by fusing time synchronous and time asynchronous representations

    Wen Wu, Chao Zhang, and Philip C Woodland. Emotion recognition by fusing time synchronous and time asynchronous representations. In Proc. ICASSP, Toronto, Canada, 2021

  48. [55]

    Improving prosody modelling with cross-utterance BERT embeddings for end-to-end speech synthesis

    Guanghui Xu, Wei Song, Zhengchen Zhang, Chao ZHang, Xiaodong He, and Bowen Zhou. Improving prosody modelling with cross-utterance BERT embeddings for end-to-end speech synthesis. In Proc. ICASSP, 2021

  49. [56]

    WikiQA : A challenge dataset for open-domain question answering

    Yi Yang, Wen-tau Yih, and Christopher Meek. WikiQA : A challenge dataset for open-domain question answering. In Proc. EMNLP, Lisbon, Portugal, 2015

  50. [58]

    SpeechGPT : E mpowering large language models with intrinsic cross-modal conversational abilities

    Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. SpeechGPT : E mpowering large language models with intrinsic cross-modal conversational abilities. arXiv:2305.11000, 2023 a

  51. [59]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-LLaMA : An instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858, 2023 b

  52. [60]

    Prosody modelling with pre-trained cross-utterance representations for improved speech synthesis

    Ya-Jie Zhang, Chao Zhang, Wei Song, Zhengchen Zhang, Yonghui Wu, and Xiaodong He. Prosody modelling with pre-trained cross-utterance representations for improved speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31: 0 2812--2823, 2023 c

  53. [61]

    Learning video representations from large language models

    Yue Zhao, Ishan Misra, Philipp Krähenbühl, and Rohit Girdhar. Learning video representations from large language models. In Proc. CVPR, New Orleans, 2022

  54. [62]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4 : Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023

  55. [63]

    Language models are few-shot learners , author=. Proc. NeurIPS , address=

  56. [64]

    Mei, Xinhao and Meng, Chutong and Liu, Haohe and Kong, Qiuqiang and Ko, Tom and Zhao, Chengqi and Plumbley, Mark D and Zou, Yuexian and Wang, Wenwu , journal=

  57. [65]

    MusicLM: Generating Music From Text

    Agostinelli, Andrea and Denk, Timo I and Borsos, Zal. arXiv:2301.11325 , year=

  58. [66]

    Doh, SeungHeon and Choi, Keunwoo and Lee, Jongpil and Nam, Juhan , journal=

  59. [67]

    Improved automatic keyword extraction given more linguistic knowledge , author=. Proc. EMNLP , address=

  60. [68]

    arXiv preprint arXiv:2310.05863 , year=

    Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models , author=. arXiv preprint arXiv:2310.05863 , year=

  61. [69]

    Yang, Yi and Yih, Wen-tau and Meek, Christopher , booktitle=

  62. [70]

    Bastianelli, Emanuele and Vanzo, Andrea and Swietojanski, Pawel and Rieser, Verena , booktitle=

  63. [71]

    Learning features of music from scratch , author=. Proc. ICLR , address=

  64. [72]

    Computer Speech & Language , volume =

    Arsha Nagrani and Joon Son Chung and Weidi Xie and Andrew Zisserman , title=. Computer Speech & Language , volume =. 2019 , publisher=

  65. [73]

    Cosentino, Joris and Pariente, Manuel and Cornell, Samuele and Deleforge, Antoine and Vincent, Emmanuel , journal=

  66. [74]

    2008 , publisher=

    Busso, Carlos and Bulut, Murtaza and Lee, Chi-Chun and Kazemzadeh, Abe and Mower, Emily and Kim, Samuel and Chang, Jeannette N and Lee, Sungbok and Narayanan, Shrikanth S , journal=. 2008 , publisher=

  67. [75]

    Changhan Wang and Anne Wu and Juan Pino , booktitle=

  68. [76]

    Transactions on Machine Learning Research , year=

    Emergent Abilities of Large Language Models , author=. Transactions on Machine Learning Research , year=

  69. [77]

    Huang, Chien-yu and Lu, Ke-Han and Wang, Shih-Heng and Hsiao, Chi-Yuan and Kuan, Chun-Yi and Wu, Haibin and Arora, Siddhant and Chang, Kai-Wei and Shi, Jiatong and Peng, Yifan and others , journal=

  70. [78]

    Finetuned language models are zero-shot learners , author=. Proc. ICLR , year=

  71. [79]

    Drossos, Konstantinos and Lipping, Samuel and Virtanen, Tuomas , booktitle=. Clotho:

  72. [80]

    Kim, Chris Dongjoo and Kim, Byeongchang and Lee, Hyunmin and Kim, Gunhee , booktitle=

  73. [81]

    Librispeech:

    Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev , booktitle=. Librispeech:

  74. [82]

    Chen, Guoguo and Chai, Shuzhou and Wang, Guanbo and Du, Jiayu and Zhang, Wei-Qiang and Weng, Chao and Su, Dan and Povey, Daniel and Trmal, Jan and Zhang, Junbo and others , booktitle=

  75. [83]

    and Stoica, Ion and Xing, Eric P

    Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month=. Vicuna: An Open-Source Chatbot Impressing

  76. [84]

    Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and others , journal=

  77. [85]

    Visual Instruction Tuning

    Visual Instruction Tuning , author=. arXiv:2304.08485 , year=

  78. [86]

    Scaling Instruction-Finetuned Language Models

    Scaling instruction-finetuned language models , author=. arXiv:2210.11416 , year=

  79. [87]

    Training language models to follow instructions with human feedback , author=. Proc. NeurIPS , address=

  80. [88]

    Instruction tuning with

    Peng, Baolin and Li, Chunyuan and He, Pengcheng and Galley, Michel and Gao, Jianfeng , journal=. Instruction tuning with

Showing first 80 references.