super hub Canonical reference

Qwen2-Audio Technical Report

Haojie Wei, Jin Xu, Qian Yang, Xipin Wei, Yunfei Chu, Zhifang Guo · 2024 · eess.AS · arXiv 2407.10759

Canonical reference. 76% of citing Pith papers cite this work as background.

139 Pith papers citing it

Background 76% of classified citations

open full Pith review browse 139 citing papers more from Haojie Wei arXiv PDF

abstract

We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. In contrast to complex hierarchical tags, we have simplified the pre-training process by utilizing natural language prompts for different data and tasks, and have further expanded the data volume. We have boosted the instruction-following capability of Qwen2-Audio and implemented two distinct audio interaction modes for voice chat and audio analysis. In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input. In the audio analysis mode, users could provide audio and text instructions for analysis during the interaction. Note that we do not use any system prompts to switch between voice chat and audio analysis modes. Qwen2-Audio is capable of intelligently comprehending the content within audio and following voice commands to respond appropriately. For instance, in an audio segment that simultaneously contains sounds, multi-speaker conversations, and a voice command, Qwen2-Audio can directly understand the command and provide an interpretation and response to the audio. Additionally, DPO has optimized the model's performance in terms of factuality and adherence to desired behavior. According to the evaluation results from AIR-Bench, Qwen2-Audio outperformed previous SOTAs, such as Gemini-1.5-pro, in tests focused on audio-centric instruction-following capabilities. Qwen2-Audio is open-sourced with the aim of fostering the advancement of the multi-modal language community.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 19 baseline 5 method 1

citation-polarity summary

background 19 baseline 5 use method 1

claims ledger

abstract We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. In contrast to complex hierarchical tags, we have simplified the pre-training process by utilizing natural language prompts for different data and tasks, and have further expanded the data volume. We have boosted the instruction-following capability of Qwen2-Audio and implemented two distinct audio interaction modes for voice chat and audio an

authors

Haojie Wei Jin Xu Qian Yang Xipin Wei Yunfei Chu Zhifang Guo

co-cited works

representative citing papers

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval

cs.AI · 2026-05-05 · unverdicted · novelty 8.0 · 2 refs

ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

AVI-Bench is a cognitively inspired benchmark that evaluates Omni-MLLMs on joint audio-visual tasks and reveals substantial limitations in current models.

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

eess.AS · 2026-06-01 · unverdicted · novelty 7.0

SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

MusTBENCH evaluates temporal grounding in large audio-language models via five expert-validated tasks, and MusT improves performance through encoder adaptation, LLM adaptation, supervised fine-tuning, and RL optimization.

AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

cs.AI · 2026-05-23 · unverdicted · novelty 7.0

AVBench is a benchmark for human-centric AV generation evaluation featuring ten fine-grained dimensions and preference-learned evaluators that output continuous probabilistic scores from binary decisions.

DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action

eess.AS · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

DuplexSLA introduces a three-channel full-duplex architecture that synchronizes continuous user audio, discrete assistant audio, and rate-limited textual actions inside a single backbone for native turn-taking and in-conversation tool use.

Codec-Robust Attacks on Audio LLMs

cs.SD · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.

Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues

cs.SD · 2026-05-15 · unverdicted · novelty 7.0

ToxiAlert-Bench dataset and dual-head neural network detect toxic speech by distinguishing textual versus paralinguistic sources, reporting 21.1% Macro-F1 and 13% accuracy gains over baselines.

SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

SpurAudio benchmark shows state-of-the-art few-shot audio classifiers suffer large performance drops when background correlations are disrupted, even in large pretrained models.

NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating

cs.SD · 2026-05-13 · unverdicted · novelty 7.0

NAACA uses a neuro-inspired oscillatory working memory to gate attention in audio language models, raising AudioQwen's average precision from 53.5% to 70.6% on XD-Violence while cutting unnecessary calls.

How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of weaker integration.

MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

cs.CR · 2026-05-06 · conditional · novelty 7.0

TAGO performs sparse jailbreak optimization on audio LMs by retaining only high-gradient-energy tokens, preserving near-full ASR at 25% retention across three models.

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

eess.AS · 2026-04-21 · unverdicted · novelty 7.0

Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

cs.CR · 2026-04-16 · unverdicted · novelty 7.0

AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.

Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.

HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models

eess.AS · 2026-04-13 · unverdicted · novelty 7.0

HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semantic conflict resolution.

Ti-Audio: The First Multi-Dialectal End-to-End Speech LLM for Tibetan

cs.SD · 2026-04-13 · unverdicted · novelty 7.0

Ti-Audio is the first multi-dialectal end-to-end Speech-LLM for Tibetan that achieves state-of-the-art performance on ASR and speech translation benchmarks via a Dynamic Q-Former Adapter and cross-dialect cooperation.

Unified Multimodal Uncertain Inference

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Introduces UMUI task for fine-grained multimodal probabilistic inference and CLUE calibration method, where a 3B model matches larger baselines.

citing papers explorer

Showing 50 of 139 citing papers.

Rethinking Continual Learning for Speech and Audio: A Representation-Centric Taxonomy and Open Problems eess.AS · 2026-05-24 · unverdicted · none · ref 2 · internal anchor
Introduces a representation-geometry-based taxonomy for continual learning in speech and audio, identifies mismatches with current CL assumptions in foundation models, and lists open challenges.
Multimodal LLMs under Pairwise Modalities cs.CV · 2026-05-20 · unverdicted · none · ref 10 · internal anchor
A two-stage framework enables multimodal LLMs to learn shared latent representations from pairwise modality data and achieve cross-modal generation when incorporating new modalities.
Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models cs.CR · 2026-05-18 · unverdicted · none · ref 10 · internal anchor
AIA generates universal interference audio infused with Acoustic Latent Semantics to bypass LALM safety alignment, achieving SOTA attack success rates on 10 models across five datasets.
Can Large Audio Language Models Ignore Multilingual Distractors? An Evaluation of Their Selective Auditory Attention Capabilities eess.AS · 2026-05-17 · unverdicted · none · ref 25 · internal anchor
Introduces the MUSA benchmark and evaluates LALMs showing that strong single-speaker performance fails to ensure robust selective attention under multilingual interference, with errors from source confusion and unresolved attribution after separation.
Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs cs.CL · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
A sequence-tagger-guided LLM with contrastive objective corrects disfluencies in Hindi, Bengali, and Marathi ASR transcripts, outperforming removal-only baselines.
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model eess.AS · 2026-05-12 · unverdicted · none · ref 30 · internal anchor
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs cs.CL · 2026-05-11 · unverdicted · none · ref 9 · internal anchor
EchoDistill applies noisy-to-clean self-distillation with GRPO to boost Audio LLM robustness, reporting 4.18% average GSR gains under strong noise.
ORACLE: Anticipating Scams from Partial Trajectories in Streaming App Usage cs.LG · 2026-05-09 · unverdicted · none · ref 7 · internal anchor
ORACLE is a new agentic framework using adaptive context consolidation and teacher-student distillation to detect emerging scam patterns from incomplete, long-horizon app usage streams across 12 scam types.
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models cs.SD · 2026-05-06 · unverdicted · none · ref 3 · internal anchor
VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions eess.AS · 2026-05-06 · unverdicted · none · ref 52 · internal anchor
JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.
When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition cs.AI · 2026-05-04 · unverdicted · none · ref 13 · internal anchor
Current audio-language models fail to use clinical multimodal context for dysarthric speech recognition, but context-aware LoRA fine-tuning delivers large accuracy gains on the SAP dataset.
Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time cs.LG · 2026-05-03 · unverdicted · none · ref 8 · internal anchor
LIME reduces hallucinations in multimodal LLMs by using LRP to boost perceptual modality contributions through inference-time KV updates.
EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness cs.CV · 2026-05-01 · unverdicted · none · ref 5 · internal anchor
EmoMM benchmark reveals Video Contribution Collapse in MLLMs for emotion recognition under modality conflict and missingness, mitigated by CHASE head-level attention steering.
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation cs.SD · 2026-04-27 · unverdicted · none · ref 9 · internal anchor
Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.
Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and Diagnosis eess.AS · 2026-04-24 · unverdicted · none · ref 38 · internal anchor
CROTTC-IF is a prompt-free MDD system with monotonic frame-level alignment and implicit knowledge transfer that reaches 71.77% F1 on L2-ARCTIC and 71.70% on Iqra'Eval2.
Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages cs.CL · 2026-04-20 · conditional · none · ref 4 · internal anchor
Phoneme-level analysis of ASR on Archi and Rutul shows data scarcity explains recognition errors better than phonological complexity, with language-specific adaptations improving wav2vec2 performance.
VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech eess.AS · 2026-04-19 · unverdicted · none · ref 40 · internal anchor
VIBE evaluates generative biases in large audio-language models with real-world speech and open-ended tasks, showing that gender cues produce larger distributional shifts than accent cues across 11 tested models.
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding cs.SD · 2026-04-14 · unverdicted · none · ref 6 · internal anchor
SpotSound adds a hallucination-suppressing objective and a needle-in-haystack benchmark to audio-language models, reaching state-of-the-art temporal grounding while keeping general task performance.
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization eess.AS · 2026-04-13 · unverdicted · none · ref 41 · internal anchor
A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on downstream tasks.
LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation cs.SD · 2026-04-13 · unverdicted · none · ref 7 · internal anchor
LaDA-Band applies discrete masked diffusion with dual-track conditioning and progressive training to generate vocal-to-accompaniment tracks that improve acoustic authenticity, global coherence, and dynamic orchestration over prior baselines.
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models cs.CL · 2026-04-11 · unverdicted · none · ref 46 · internal anchor
ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking cs.SD · 2026-04-10 · unverdicted · none · ref 5 · internal anchor
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.
Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs cs.SD · 2026-04-10 · unverdicted · none · ref 26 · internal anchor
NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs eess.AS · 2026-04-09 · unverdicted · none · ref 4 · internal anchor
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings cs.ET · 2026-03-10 · unverdicted · none · ref 2 · internal anchor
MM-tau-p² is a new benchmark with 12 metrics that measures how well multi-modal agents adapt to user personas and maintain robustness in dual-control interactions.
TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling cs.SD · 2026-03-05 · unverdicted · none · ref 44 · internal anchor
TW-Sound580K dataset plus Tai-LALM model with dynamic Dual-ASR arbitration lifts localized Taiwanese audio-language accuracy to 49.1% on the TAU benchmark.
MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages cs.CL · 2025-12-01 · conditional · none · ref 7 · internal anchor
MCAT scales MLLMs to many-to-many speech translation across 70 languages via curriculum learning and a 30-token speech adapter, surpassing prior SOTA on FLEURS while improving speed.
HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal cs.SD · 2025-11-26 · conditional · none · ref 2 · internal anchor
A black-box audio watermark removal attack trained on limited samples that generalizes across datasets and watermark schemes with high attack success rates.
A cross-species neural foundation model for end-to-end speech decoding cs.CL · 2025-11-21 · unverdicted · none · ref 1 · internal anchor
A cross-species pretrained neural encoder combined with end-to-end training and audio LLMs reduces word error rate in neural speech decoding from 24.69% to 10.22% while aligning attempted and imagined speech.
Benchmarking Gaslighting Attacks Against Speech Large Language Models cs.CL · 2025-09-24 · unverdicted · none · ref 5 · internal anchor
Gaslighting attacks using Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation strategies cause a 24.3% average accuracy drop in Speech LLMs while also triggering behavioral changes like apologies and refusals.
Qwen3-Omni Technical Report cs.CL · 2025-09-22 · unverdicted · none · ref 5 · internal anchor
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.
Training-Free Multimodal Large Language Model Orchestration cs.CL · 2025-08-06 · unverdicted · none · ref 52 · 2 links · internal anchor
LLM Orchestration integrates modality experts via an LLM controller, cross-modal memory, and interaction layer to enable multimodal input-output without gradient-based training.
Step-Audio 2 Technical Report cs.CL · 2025-07-22 · unverdicted · none · ref 13 · internal anchor
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
StressTest: Can YOUR Speech LM Handle the Stress? cs.CL · 2025-05-28 · conditional · none · ref 1 · internal anchor
Speech language models fail at reasoning about sentence stress but improve after fine-tuning on a new 17k-example synthetic dataset that varies stress to alter meaning.
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction cs.CL · 2025-02-17 · unverdicted · none · ref 38 · internal anchor
Step-Audio introduces a 130B-parameter unified speech-text model with open-sourced components for understanding, generation, affordable voice cloning, and dynamic control, claiming SOTA human evaluation results on a new benchmark.
Beyond Binary Instrument QA: Probing Instrument Grounding in Music Audio-Language Models cs.SD · 2026-06-30 · unverdicted · none · ref 4 · internal anchor
Introduces an OpenMIC-derived multi-axis benchmark sequence showing that high binary instrument QA accuracy fails to predict robust grounding, with models showing position bias, confusable errors, and temporal bias.
LeVo 2: Stable and Melodious Song Generation via Hierarchical Representation Modeling and Progressive Post-Training cs.SD · 2026-06-29 · unverdicted · none · ref 53 · internal anchor
LeVo 2 presents a hierarchical LLM-Diffusion model with progressive post-training stages to generate full-length songs that balance semantic planning, track-specific acoustics, and musicality.
ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models cs.SD · 2026-06-27 · unverdicted · none · ref 11 · internal anchor
ALM2Vec learns unified audio embeddings from large audio-language models for text-audio retrieval, instruction-aware retrieval, and other tasks across domains.
SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors cs.CL · 2026-05-30 · unverdicted · none · ref 55 · internal anchor
SALSA adapts speech-aware LLMs via supervised layer-wise steering vectors, reporting up to 46.8% relative gains over zero-shot on out-of-domain speech benchmarks.
A Unified and Reproducible Experimentation Framework for Speech Understanding eess.AS · 2026-05-29 · unverdicted · none · ref 41 · internal anchor
SURE is a new standardized framework for evaluating and training speech foundation models and Speech LLMs to improve comparability and reproducibility under realistic conditions.
Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation cs.AI · 2026-05-27 · unverdicted · none · ref 2 · internal anchor
ESRT achieves SOTA many-to-many S2TT across 45 languages on FLEURS via edge-cloud split inference that compresses features 10x and a multi-task curriculum learning strategy for cross-lingual balance.
Zero-Shot Parkinson's Disease Detection from Speech: Comparing Large Audio and Language Models cs.SD · 2026-05-24 · unverdicted · none · ref 44 · internal anchor
Handcrafted acoustic features offer more stable zero-shot Parkinson's detection in low-resource languages like Bengali compared to raw audio inputs which vary by dataset.
Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation cs.CL · 2026-05-20 · unverdicted · none · ref 24 · internal anchor
InterRS enables real-time speech generation with interleaved reasoning via a controlled data pipeline, interleaved SFT, and RL using TA-Balance and Linguistic Quality rewards, yielding 13% gains on math and logic benchmarks.
Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training cs.SD · 2026-05-18 · unverdicted · none · ref 16 · internal anchor
GST uses gradient-based affinity metrics to form dataset groups and applies progressive scheduling, achieving 30-40% faster convergence than uniform mixture training on 14 AudioQA datasets while matching or exceeding performance.
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook cs.SD · 2026-05-18 · unverdicted · none · ref 18 · internal anchor
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech cs.CV · 2026-05-14 · unverdicted · none · ref 113 · internal anchor
AgentSteerTTS proposes a multi-agent framework with adversarial disentanglement, dual-stream anchoring via acoustic prototypes, and fast-slow feedback to achieve intent-faithful expressive TTS for composite instructions.
Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models eess.AS · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
A statistical sign-off protocol for audio compressors ensures worst-case answer preservation across query families in LALMs.
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM cs.CL · 2026-05-07 · unverdicted · none · ref 62 · 2 links · internal anchor
TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.
AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA cs.CL · 2026-04-23 · unverdicted · none · ref 3 · internal anchor
AUDITA is a challenging audio QA benchmark where humans score 32% accuracy on average while state-of-the-art models score below 9%, using IRT to reveal systematic model deficiencies.
Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps cs.CL · 2026-04-21 · unverdicted · none · ref 4 · internal anchor
Four attention metrics enable logistic regression classifiers that detect hallucinations in SpeechLLMs with up to +0.23 PR-AUC gains over baselines on ASR and translation tasks.

Qwen2-Audio Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer