hub Canonical reference

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan · 2023 · eess.AS · arXiv 2311.07919

Canonical reference. 89% of citing Pith papers cite this work as background.

52 Pith papers citing it

Background 89% of classified citations

open full Pith review browse 52 citing papers arXiv PDF

abstract

Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome the one-to-many interference, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively. Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 baseline 1

citation-polarity summary

background 8 baseline 1

representative citing papers

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

Codec-Robust Attacks on Audio LLMs

cs.SD · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.

AffectVerse: Emotional World Models for Multimodal Affective Computing

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and belief aggregation.

NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating

cs.SD · 2026-05-13 · unverdicted · novelty 7.0

NAACA uses a neuro-inspired oscillatory working memory to gate attention in audio language models, raising AudioQwen's average precision from 53.5% to 70.6% on XD-Violence while cutting unnecessary calls.

Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration

cs.SD · 2026-05-11 · unverdicted · novelty 7.0

Polyphonia improves zero-shot stem-specific timbre transfer in polyphonic music by 15.5% target alignment via acoustic-informed attention calibration that uses probabilistic priors to set coarse boundaries.

MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition

cs.HC · 2026-05-07 · unverdicted · novelty 7.0

AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emotion recognition benchmarks.

ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence

cs.SD · 2026-04-22 · unverdicted · novelty 7.0

ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

eess.AS · 2026-04-21 · unverdicted · novelty 7.0

Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

cs.CR · 2026-04-16 · unverdicted · novelty 7.0

AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.

Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.

HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models

eess.AS · 2026-04-13 · unverdicted · novelty 7.0

HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semantic conflict resolution.

Speaker Attributed Automatic Speech Recognition Using Speech Aware LLMS

eess.AS · 2026-04-13 · unverdicted · novelty 7.0

Adapting speech-aware LLMs with speaker cluster identification tags and concatenated multi-speaker data yields superior speaker-attributed ASR performance versus sequential diarization-plus-ASR pipelines.

FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

FoleyDesigner generates spatio-temporally aligned stereo Foley audio for film clips via multi-agent analysis, diffusion models on video cues, and LLM mixing, supported by the new FilmStereo dataset.

AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering

eess.AS · 2026-01-18 · unverdicted · novelty 7.0

AQUA-Bench evaluates audio QA models on three unanswerability scenarios: missing correct answers, mismatched choice sets, and questions irrelevant to the audio.

MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

eess.AS · 2025-07-31 · unverdicted · novelty 7.0

MECAT is a multi-expert benchmark for audio AI offering fine-grained captions and QA pairs generated via expert models and LLM reasoning, paired with the DATE metric that combines semantic similarity and cross-sample discriminability to favor detailed outputs.

Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

cs.AI · 2025-06-04 · unverdicted · novelty 7.0

Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.

Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping

cs.CV · 2025-05-19 · unverdicted · novelty 7.0

A contrastive multimodal framework augments satellite-audio datasets with vision-language model sound descriptions to learn shared soundscape concepts for zero-shot retrieval and synthesis.

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

cs.CV · 2025-02-06 · unverdicted · novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.

VoiceBench: Benchmarking LLM-Based Voice Assistants

cs.CL · 2024-10-22 · unverdicted · novelty 7.0

VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.

SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

cs.SD · 2026-05-14 · unverdicted · novelty 6.0

SpeakerLLM unifies speaker profiling, recording-condition understanding, and structured verification reasoning in an audio-LLM via a hierarchical tokenizer and decision traces.

When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

Current audio-language models fail to use clinical multimodal context for dysarthric speech recognition, but context-aware LoRA fine-tuning delivers large accuracy gains on the SAP dataset.

Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

cs.SD · 2026-04-20 · unverdicted · novelty 6.0

Omni-Embed-Audio uses multimodal LLMs to match CLAP on standard audio retrieval while improving text-to-text retrieval by 22% relative and hard negative discrimination by 4.3 points HNSR@10 on user-intent queries.

citing papers explorer

Showing 50 of 52 citing papers.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models cs.SD · 2026-04-21 · unverdicted · none · ref 5 · internal anchor
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
Codec-Robust Attacks on Audio LLMs cs.SD · 2026-05-19 · unverdicted · none · ref 63 · 2 links · internal anchor
CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
AffectVerse: Emotional World Models for Multimodal Affective Computing cs.CV · 2026-05-19 · unverdicted · none · ref 5 · internal anchor
AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and belief aggregation.
NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating cs.SD · 2026-05-13 · unverdicted · none · ref 46 · internal anchor
NAACA uses a neuro-inspired oscillatory working memory to gate attention in audio language models, raising AudioQwen's average precision from 53.5% to 70.6% on XD-Violence while cutting unnecessary calls.
Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration cs.SD · 2026-05-11 · unverdicted · none · ref 79 · internal anchor
Polyphonia improves zero-shot stem-specific timbre transfer in polyphonic music by 15.5% target alignment via acoustic-informed attention calibration that uses probabilistic priors to set coarse boundaries.
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes cs.CL · 2026-05-07 · unverdicted · none · ref 52 · internal anchor
MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 60 · internal anchor
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition cs.HC · 2026-05-07 · unverdicted · none · ref 7 · internal anchor
AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emotion recognition benchmarks.
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence cs.SD · 2026-04-22 · unverdicted · none · ref 11 · internal anchor
ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages eess.AS · 2026-04-21 · unverdicted · none · ref 193 · internal anchor
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection cs.CR · 2026-04-16 · unverdicted · none · ref 37 · internal anchor
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning cs.LG · 2026-04-15 · unverdicted · none · ref 8 · internal anchor
RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models eess.AS · 2026-04-13 · unverdicted · none · ref 9 · internal anchor
HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semantic conflict resolution.
Speaker Attributed Automatic Speech Recognition Using Speech Aware LLMS eess.AS · 2026-04-13 · unverdicted · none · ref 14 · internal anchor
Adapting speech-aware LLMs with speaker cluster identification tags and concatenated multi-speaker data yields superior speaker-attributed ASR performance versus sequential diarization-plus-ASR pipelines.
FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips cs.CV · 2026-04-07 · unverdicted · none · ref 5 · internal anchor
FoleyDesigner generates spatio-temporally aligned stereo Foley audio for film clips via multi-agent analysis, diffusion models on video cues, and LLM mixing, supported by the new FilmStereo dataset.
AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering eess.AS · 2026-01-18 · unverdicted · none · ref 8 · internal anchor
AQUA-Bench evaluates audio QA models on three unanswerability scenarios: missing correct answers, mismatched choice sets, and questions irrelevant to the audio.
MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks eess.AS · 2025-07-31 · unverdicted · none · ref 23 · internal anchor
MECAT is a multi-expert benchmark for audio AI offering fine-grained captions and QA pairs generated via expert models and LLM reasoning, paired with the DATE metric that combines semantic similarity and cross-sample discriminability to favor detailed outputs.
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games cs.AI · 2025-06-04 · unverdicted · none · ref 93 · internal anchor
Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping cs.CV · 2025-05-19 · unverdicted · none · ref 6 · internal anchor
A contrastive multimodal framework augments satellite-audio datasets with vision-language model sound descriptions to learn shared soundscape concepts for zero-shot retrieval and synthesis.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs cs.CV · 2025-02-06 · unverdicted · none · ref 9 · internal anchor
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
VoiceBench: Benchmarking LLM-Based Voice Assistants cs.CL · 2024-10-22 · unverdicted · none · ref 63 · internal anchor
VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning cs.SD · 2026-05-14 · unverdicted · none · ref 5 · internal anchor
SpeakerLLM unifies speaker profiling, recording-condition understanding, and structured verification reasoning in an audio-LLM via a hierarchical tokenizer and decision traces.
When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition cs.AI · 2026-05-04 · unverdicted · none · ref 2 · internal anchor
Current audio-language models fail to use clinical multimodal context for dysarthric speech recognition, but context-aware LoRA fine-tuning delivers large accuracy gains on the SAP dataset.
Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval cs.SD · 2026-04-20 · unverdicted · none · ref 40 · internal anchor
Omni-Embed-Audio uses multimodal LLMs to match CLAP on standard audio retrieval while improving text-to-text retrieval by 22% relative and hard negative discrimination by 4.3 points HNSR@10 on user-intent queries.
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding cs.SD · 2026-04-14 · unverdicted · none · ref 7 · internal anchor
SpotSound adds a hallucination-suppressing objective and a needle-in-haystack benchmark to audio-language models, reaching state-of-the-art temporal grounding while keeping general task performance.
TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling cs.SD · 2026-03-05 · unverdicted · none · ref 16 · internal anchor
TW-Sound580K dataset plus Tai-LALM model with dynamic Dual-ASR arbitration lifts localized Taiwanese audio-language accuracy to 49.1% on the TAU benchmark.
MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages cs.CL · 2025-12-01 · conditional · none · ref 4 · internal anchor
MCAT scales MLLMs to many-to-many speech translation across 70 languages via curriculum learning and a 30-token speech adapter, surpassing prior SOTA on FLEURS while improving speed.
Investigating Modality Contribution in Audio LLMs for Music cs.LG · 2025-09-25 · unverdicted · none · ref 11 · internal anchor
Adapts MM-SHAP to quantify modality contributions in two Audio LLMs on MuChoMusic, showing text dominance alongside limited audio localization of key events.
Qwen3-Omni Technical Report cs.CL · 2025-09-22 · unverdicted · none · ref 4 · internal anchor
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.
GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2 cs.PL · 2025-09-17 · conditional · none · ref 21 · internal anchor
GraphMend uses two Jaseci-based code transformations to eliminate dynamic-control-flow and side-effect graph breaks in PyTorch 2, reducing breaks to zero in six of eight Hugging Face models and yielding up to 75% latency reduction on RTX 3090 and A40 GPUs.
Step-Audio 2 Technical Report cs.CL · 2025-07-22 · unverdicted · none · ref 12 · internal anchor
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning cs.RO · 2025-05-24 · conditional · none · ref 14 · internal anchor
VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning cs.LG · 2025-05-22 · conditional · none · ref 8 · internal anchor
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot cs.CL · 2024-12-03 · conditional · none · ref 10 · internal anchor
GLM-4-Voice builds an end-to-end spoken chatbot by deriving a 175bps single-codebook tokenizer from ASR, synthesizing interleaved speech-text data, and continuing pre-training of GLM-4-9B on up to 1 trillion tokens before fine-tuning on conversational speech.
Academic Text-to-Music Grand Challenge: Datasets, Baselines, and Evaluation Methods cs.SD · 2026-05-20 · accept · none · ref 9 · internal anchor
The paper introduces the ATTM Grand Challenge with a CC-licensed instrumental subset of MTG-Jamendo, two tracks, and evaluation via FAD, CLAP, and a new Concept Coverage Score to support academic text-to-music research.
Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training cs.SD · 2026-05-18 · unverdicted · none · ref 15 · internal anchor
GST uses gradient-based affinity metrics to form dataset groups and applies progressive scheduling, achieving 30-40% faster convergence than uniform mixture training on 14 AudioQA datasets while matching or exceeding performance.
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook cs.SD · 2026-05-18 · unverdicted · none · ref 15 · internal anchor
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction cs.AI · 2026-05-18 · unverdicted · none · ref 2 · internal anchor
Multimodal LLMs suffer Safety Geometry Collapse from modality-induced drift that reduces refusal separability; ReGap corrects drift at inference time using self-rectification signals to restore safety without retraining.
TinyMU: A Compact Audio-Language Model for Music Understanding cs.SD · 2026-04-17 · unverdicted · none · ref 7 · internal anchor
TinyMU is a 229M-parameter compact music understanding model that achieves 82% of state-of-the-art large audio-language model performance on the MuChoMusic benchmark while being 35 times smaller.
Qwen3.5-Omni Technical Report cs.CL · 2026-04-17 · unverdicted · none · ref 6 · internal anchor
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.
Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt cs.SD · 2026-04-15 · unverdicted · none · ref 11 · internal anchor
TimePro-RL interleaves timestamp embeddings in audio sequences and applies RL post-SFT to boost temporal alignment in LALMs, yielding gains on grounding, event detection, and dense captioning.
Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition cs.CL · 2026-04-10 · unverdicted · none · ref 16 · internal anchor
The authors introduce LLM-based semantic judgment and an agentic interaction loop that improves semantic fidelity and enables iterative corrections in automatic speech recognition beyond traditional WER.
End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering cs.SD · 2025-11-12 · unverdicted · none · ref 7 · internal anchor
CLSR is an end-to-end contrastive language-speech retriever using an intermediate text-like conversion step to improve retrieval of relevant segments from long audio for spoken question answering.
Direct Simultaneous Translation Activation for Large Audio-Language Models cs.SD · 2025-09-19 · unverdicted · none · ref 14 · internal anchor
Augmenting standard offline training data with only 1% randomly truncated simultaneous examples activates real-time translation output in large audio-language models with no architecture or decoding changes.
Enhancing Speech Large Language Models through Reinforced Behavior Alignment cs.CL · 2025-08-25 · unverdicted · none · ref 13 · internal anchor
Reinforced Behavior Alignment (RBA) uses self-synthesized data from a teacher LLM and reinforcement learning to close the instruction-following gap in SpeechLMs, outperforming distillation and reaching SOTA on spoken QA and speech-to-text translation benchmarks.
Kimi-Audio Technical Report eess.AS · 2025-04-25 · unverdicted · none · ref 10 · internal anchor
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks cs.AI · 2026-05-18 · unverdicted · none · ref 3 · internal anchor
MLLMs achieve only 42% accuracy on a new audio-visual task requiring second-order spatial ToM under perceptual limits, while a proposed sensory-bounded CoT outperforms egocentric and allocentric baselines.
In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions eess.AS · 2026-04-14 · unverdicted · none · ref 25 · internal anchor
Lightweight training strategies allow speech-aware LLMs to output accurate word timestamps alongside ASR transcripts while also improving recognition quality across datasets.
Whisper-AuT: Domain-Adapted Audio Encoder for Efficient Audio-LLM Training cs.SD · 2026-04-12 · unverdicted · none · ref 3 · internal anchor
Whisper-AuT is a domain-adapted audio encoder obtained by fine-tuning Whisper-large-v3 on mixed speech, environmental, and music data, yielding gains of +23% on ESC-50, +5% on GTZAN, and +0.7% on Speech Commands.
Qwen2-Audio Technical Report eess.AS · 2024-07-15 · unverdicted · none · ref 6 · internal anchor
Qwen2-Audio is an open-source audio-language model that outperforms prior systems such as Gemini-1.5-pro on audio-centric instruction-following benchmarks after simplified prompt-based pre-training and expanded data.

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer