TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
super hub Canonical reference
Qwen2-Audio Technical Report
Canonical reference. 75% of citing Pith papers cite this work as background.
abstract
We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. In contrast to complex hierarchical tags, we have simplified the pre-training process by utilizing natural language prompts for different data and tasks, and have further expanded the data volume. We have boosted the instruction-following capability of Qwen2-Audio and implemented two distinct audio interaction modes for voice chat and audio analysis. In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input. In the audio analysis mode, users could provide audio and text instructions for analysis during the interaction. Note that we do not use any system prompts to switch between voice chat and audio analysis modes. Qwen2-Audio is capable of intelligently comprehending the content within audio and following voice commands to respond appropriately. For instance, in an audio segment that simultaneously contains sounds, multi-speaker conversations, and a voice command, Qwen2-Audio can directly understand the command and provide an interpretation and response to the audio. Additionally, DPO has optimized the model's performance in terms of factuality and adherence to desired behavior. According to the evaluation results from AIR-Bench, Qwen2-Audio outperformed previous SOTAs, such as Gemini-1.5-pro, in tests focused on audio-centric instruction-following capabilities. Qwen2-Audio is open-sourced with the aim of fostering the advancement of the multi-modal language community.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. In contrast to complex hierarchical tags, we have simplified the pre-training process by utilizing natural language prompts for different data and tasks, and have further expanded the data volume. We have boosted the instruction-following capability of Qwen2-Audio and implemented two distinct audio interaction modes for voice chat and audio an
authors
co-cited works
representative citing papers
ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.
MusTBENCH evaluates temporal grounding in large audio-language models via five expert-validated tasks, and MusT improves performance through encoder adaptation, LLM adaptation, supervised fine-tuning, and RL optimization.
CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
ToxiAlert-Bench dataset and dual-head neural network detect toxic speech by distinguishing textual versus paralinguistic sources, reporting 21.1% Macro-F1 and 13% accuracy gains over baselines.
SpurAudio benchmark shows state-of-the-art few-shot audio classifiers suffer large performance drops when background correlations are disrupted, even in large pretrained models.
NAACA uses a neuro-inspired oscillatory working memory to gate attention in audio language models, raising AudioQwen's average precision from 53.5% to 70.6% on XD-Violence while cutting unnecessary calls.
Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of weaker integration.
MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semantic conflict resolution.
Ti-Audio is the first multi-dialectal end-to-end Speech-LLM for Tibetan that achieves state-of-the-art performance on ASR and speech translation benchmarks via a Dynamic Q-Former Adapter and cross-dialect cooperation.
Introduces UMUI task for fine-grained multimodal probabilistic inference and CLUE calibration method, where a 3B model matches larger baselines.
Jamendo-MT-QA is a new dataset and benchmark for multi-track comparative music question answering, constructed via an LLM-assisted pipeline from Creative Commons Jamendo tracks and used to evaluate audio-language models.
Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.
KoALa-Bench is a new public benchmark with six tasks that tests Korean speech recognition, translation, question answering, instruction following, and faithfulness in large audio language models.
Human preferences for the same semantic content show near-chance agreement between text and audio, with audio raters using narrower decision thresholds, less length bias, and more user-oriented criteria.
A learnable prompt projector added to LLM-based ASR reduces prompt sensitivity, lowers performance variability, and beats the best fixed prompts on four datasets.
citing papers explorer
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval
ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.
-
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
-
DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
-
PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects
PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.
-
MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs
MusTBENCH evaluates temporal grounding in large audio-language models via five expert-validated tasks, and MusT improves performance through encoder adaptation, LLM adaptation, supervised fine-tuning, and RL optimization.
-
Codec-Robust Attacks on Audio LLMs
CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
-
Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues
ToxiAlert-Bench dataset and dual-head neural network detect toxic speech by distinguishing textual versus paralinguistic sources, reporting 21.1% Macro-F1 and 13% accuracy gains over baselines.
-
SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification
SpurAudio benchmark shows state-of-the-art few-shot audio classifiers suffer large performance drops when background correlations are disrupted, even in large pretrained models.
-
NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating
NAACA uses a neuro-inspired oscillatory working memory to gate attention in audio language models, raising AudioQwen's average precision from 53.5% to 70.6% on XD-Violence while cutting unnecessary calls.
-
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of weaker integration.
-
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
-
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
-
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
-
Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning
RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
-
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semantic conflict resolution.
-
Ti-Audio: The First Multi-Dialectal End-to-End Speech LLM for Tibetan
Ti-Audio is the first multi-dialectal end-to-end Speech-LLM for Tibetan that achieves state-of-the-art performance on ASR and speech translation benchmarks via a Dynamic Q-Former Adapter and cross-dialect cooperation.
-
Unified Multimodal Uncertain Inference
Introduces UMUI task for fine-grained multimodal probabilistic inference and CLUE calibration method, where a 3B model matches larger baselines.
-
Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering
Jamendo-MT-QA is a new dataset and benchmark for multi-track comparative music question answering, constructed via an LLM-assisted pipeline from Creative Commons Jamendo tracks and used to evaluate audio-language models.
-
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.
-
KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness
KoALa-Bench is a new public benchmark with six tasks that tests Korean speech recognition, translation, question answering, instruction following, and faithfulness in large audio language models.
-
Same Words, Different Judgments: How Preferences Vary Across Modalities
Human preferences for the same semantic content show near-chance agreement between text and audio, with audio raters using narrower decision thresholds, less length bias, and more user-oriented criteria.
-
Reducing Prompt Sensitivity in LLM-based Speech Recognition Through Learnable Projection
A learnable prompt projector added to LLM-based ASR reduces prompt sensitivity, lowers performance variability, and beats the best fixed prompts on four datasets.
-
AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering
AQUA-Bench evaluates audio QA models on three unanswerability scenarios: missing correct answers, mismatched choice sets, and questions irrelevant to the audio.
-
MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus
MCGA is a new 119-hour multi-task audio corpus for classical Chinese literary genres that shows current MLLMs face substantial challenges on its test set.
-
The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models
TWNM framework equips audio-language models with spatial scene analysis via FOA simulation and metadata-grounded training, reaching 70.8% accuracy on a new ASA benchmark.
-
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
-
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.
-
SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation
SQ-LLM, trained on the new SpeechEval dataset of 32k multilingual clips with 128k annotations, enables LLMs to perform interpretable multi-task speech quality evaluation including assessment, comparison, improvement suggestions, and deepfake detection.
-
NUTSHELL: A Dataset for Abstract Generation from Scientific Talks
NUTSHELL is a new open dataset of ACL talks paired with abstracts, accompanied by baselines that demonstrate training benefits for speech-to-abstract generation while highlighting remaining challenges.
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
VoiceBench: Benchmarking LLM-Based Voice Assistants
VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.
-
ProactiveLLM: Learning Active Interaction for Streaming Large Language Models
ProactiveLLM enables active interaction in streaming LLMs by learning semantic sufficiency cues from partial inputs through mask-based modeling and synchronized privileged self-distillation without external supervision.
-
Multimodal LLMs under Pairwise Modalities
A two-stage framework enables multimodal LLMs to learn shared latent representations from pairwise modality data and achieve cross-modal generation when incorporating new modalities.
-
Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models
AIA generates universal interference audio infused with Acoustic Latent Semantics to bypass LALM safety alignment, achieving SOTA attack success rates on 10 models across five datasets.
-
Can Large Audio Language Models Ignore Multilingual Distractors? An Evaluation of Their Selective Auditory Attention Capabilities
Introduces the MUSA benchmark and evaluates LALMs showing that strong single-speaker performance fails to ensure robust selective attention under multilingual interference, with errors from source confusion and unresolved attribution after separation.
-
Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs
A sequence-tagger-guided LLM with contrastive objective corrects disfluencies in Hindi, Bengali, and Marathi ASR transcripts, outperforming removal-only baselines.
-
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
-
ORACLE: Anticipating Scams from Partial Trajectories in Streaming App Usage
ORACLE is a new agentic framework using adaptive context consolidation and teacher-student distillation to detect emerging scam patterns from incomplete, long-horizon app usage streams across 12 scam types.
-
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
-
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.
-
When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition
Current audio-language models fail to use clinical multimodal context for dysarthric speech recognition, but context-aware LoRA fine-tuning delivers large accuracy gains on the SAP dataset.
-
Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time
LIME reduces hallucinations in multimodal LLMs by using LRP to boost perceptual modality contributions through inference-time KV updates.
-
EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness
EmoMM benchmark reveals Video Contribution Collapse in MLLMs for emotion recognition under modality conflict and missingness, mitigated by CHASE head-level attention steering.
-
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation
Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.
-
Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and Diagnosis
CROTTC-IF is a prompt-free MDD system with monotonic frame-level alignment and implicit knowledge transfer that reaches 71.77% F1 on L2-ARCTIC and 71.70% on Iqra'Eval2.
-
Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages
Phoneme-level analysis of ASR on Archi and Rutul shows data scarcity explains recognition errors better than phonological complexity, with language-specific adaptations improving wav2vec2 performance.
-
VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech
VIBE evaluates generative biases in large audio-language models with real-world speech and open-ended tasks, showing that gender cues produce larger distributional shifts than accent cues across 11 tested models.
-
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding
SpotSound adds a hallucination-suppressing objective and a needle-in-haystack benchmark to audio-language models, reaching state-of-the-art temporal grounding while keeping general task performance.