pith. sign in

arxiv: 2405.08295 · v3 · pith:SJFBAEMTnew · submitted 2024-05-14 · 💻 cs.CL · cs.SD· eess.AS

SpeechVerse: A Large-scale Generalizable Audio Language Model

classification 💻 cs.CL cs.SDeess.AS
keywords tasksmodellanguagemodelsspeechspeechverseaudiobaselines
0
0 comments X
read the original abstract

Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while keeping the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform extensive benchmarking that includes comparing our model performance against traditional baselines across several datasets and tasks. Furthermore, we evaluate the model's capability for generalized instruction following by testing on out-of-domain datasets, novel prompts, and unseen tasks. Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

    cs.CL 2026-05 unverdicted novelty 7.0

    VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...

  2. Reducing Prompt Sensitivity in LLM-based Speech Recognition Through Learnable Projection

    eess.AS 2026-01 conditional novelty 7.0

    A learnable prompt projector added to LLM-based ASR reduces prompt sensitivity, lowers performance variability, and beats the best fixed prompts on four datasets.

  3. PlanRAG-Audio: Planning and Retrieval Augmented Generation for Long-form Audio Understanding

    eess.AS 2026-05 unverdicted novelty 6.0

    PlanRAG-Audio introduces a planning-based retrieval-augmented generation approach that lets large audio language models handle long recordings by selectively retrieving query-relevant information rather than processin...

  4. SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    cs.LG 2025-06 unverdicted novelty 6.0

    SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.

  5. Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

    cs.CL 2025-02 unverdicted novelty 6.0

    Step-Audio introduces a 130B-parameter unified speech-text model with open-sourced components for understanding, generation, affordable voice cloning, and dynamic control, claiming SOTA human evaluation results on a n...

  6. A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

    cs.SD 2026-05 unverdicted novelty 5.0

    A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.

  7. AUHead: Realistic Emotional Talking Head Generation via Action Units Control

    cs.CV 2026-02 unverdicted novelty 5.0

    AUHead uses audio-language models to generate Action Unit sequences from speech and feeds them into a controllable diffusion model to synthesize realistic emotional talking-head videos.

  8. A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

    cs.CL 2025-12 unverdicted novelty 5.0

    Lasso-selected speech tokens enhance text LLMs for multimodal classification by reducing long audio sequences to task-relevant features via self-supervised adaptation.

  9. Direct Simultaneous Translation Activation for Large Audio-Language Models

    cs.SD 2025-09 unverdicted novelty 5.0

    Augmenting standard offline training data with only 1% randomly truncated simultaneous examples activates real-time translation output in large audio-language models with no architecture or decoding changes.

  10. Enhancing Speech Large Language Models through Reinforced Behavior Alignment

    cs.CL 2025-08 unverdicted novelty 5.0

    Reinforced Behavior Alignment (RBA) uses self-synthesized data from a teacher LLM and reinforcement learning to close the instruction-following gap in SpeechLMs, outperforming distillation and reaching SOTA on spoken ...

  11. Qwen2.5-Omni Technical Report

    cs.CL 2025-03 conditional novelty 5.0

    Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...

  12. Enhancing BEST-RQ Pseudo-Label Quality through Online Refinement for Automatic Speech Recognition

    cs.SD 2026-06 unverdicted novelty 4.0

    Three modifications to BEST-RQ quantization (PCA projection, iterative codebook refinement, codebook distillation) reduce WER from 10.1% to 8.8% on LibriSpeech test-other.

  13. PlanRAG-Audio: Planning and Retrieval Augmented Generation for Long-form Audio Understanding

    eess.AS 2026-05 unverdicted novelty 4.0

    PlanRAG-Audio introduces planning-based retrieval-augmented generation to improve accuracy and stability of long-form audio understanding in LALMs by decoupling model input from raw audio duration.

  14. Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

    cs.CL 2026-05 unverdicted novelty 4.0

    Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.

  15. Qwen2-Audio Technical Report

    eess.AS 2024-07 unverdicted novelty 4.0

    Qwen2-Audio is an open-source audio-language model that outperforms prior systems such as Gemini-1.5-pro on audio-centric instruction-following benchmarks after simplified prompt-based pre-training and expanded data.

  16. On The Landscape of Spoken Language Models: A Comprehensive Survey

    cs.CL 2025-04 unverdicted novelty 3.0

    A literature survey that organizes spoken language models by architecture, training, and evaluation choices and identifies key challenges and future directions.