{"total":113,"items":[{"citing_arxiv_id":"2606.02739","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement","primary_cat":"cs.SD","submitted_at":"2026-06-01T18:05:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"EntangleCodec unifies semantic and acoustic audio tokenization via caption alignment and flow-matching decoding, reporting competitive reconstruction, +7.4% gains on MMAR understanding, and 0.6B-parameter ALMs surpassing 13B-parameter continuous baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01016","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects","primary_cat":"cs.CL","submitted_at":"2026-05-31T05:13:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00523","ref_index":71,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ProactiveLLM: Learning Active Interaction for Streaming Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-30T04:31:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ProactiveLLM enables active interaction in streaming LLMs by learning semantic sufficiency cues from partial inputs through mask-based modeling and synchronized privileged self-distillation without external supervision.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00460","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors","primary_cat":"cs.CL","submitted_at":"2026-05-30T00:54:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SALSA adapts speech-aware LLMs via supervised layer-wise steering vectors, reporting up to 46.8% relative gains over zero-shot on out-of-domain speech benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29300","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-28T03:28:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MusTBENCH evaluates temporal grounding in large audio-language models via five expert-validated tasks, and MusT improves performance through encoder adaptation, LLM adaptation, supervised fine-tuning, and RL optimization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28642","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation","primary_cat":"cs.AI","submitted_at":"2026-05-27T15:47:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ESRT achieves SOTA many-to-many S2TT across 45 languages on FLEURS via edge-cloud split inference that compresses features 10x and a multi-task curriculum learning strategy for cross-lingual balance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28480","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Audio-Mind: An Auditable Agentic Framework for Audio Understanding","primary_cat":"eess.AS","submitted_at":"2026-05-27T13:39:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Audio-Mind introduces a conditional, auditable agentic framework for audio understanding that preserves frontend judgment and acquires bounded external evidence only when needed, reporting 80.4% on MMAR and 82.8% on MSU-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21143","ref_index":62,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CoarseSoundNet: Building a reliable model for ecological soundscape analysis","primary_cat":"cs.SD","submitted_at":"2026-05-20T13:18:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces CoarseSoundNet, a deep learning model for classifying biophony, geophony, and anthropophony in passive acoustic monitoring recordings, reporting performance gains from additional similar data, a silence class, and decision thresholds, plus a case study on acoustic index trends.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21059","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multimodal LLMs under Pairwise Modalities","primary_cat":"cs.CV","submitted_at":"2026-05-20T11:44:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A two-stage framework enables multimodal LLMs to learn shared latent representations from pairwise modality data and achieve cross-modal generation when incorporating new modalities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20946","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation","primary_cat":"cs.CL","submitted_at":"2026-05-20T09:32:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InterRS enables real-time speech generation with interleaved reasoning via a controlled data pipeline, interleaved SFT, and RL using TA-Balance and Linguistic Quality rewards, yielding 13% gains on math and logic benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20755","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action","primary_cat":"eess.AS","submitted_at":"2026-05-20T05:54:08+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20519","ref_index":48,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Codec-Robust Attacks on Audio LLMs","primary_cat":"cs.SD","submitted_at":"2026-05-19T21:39:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20414","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PlanRAG-Audio: Planning and Retrieval Augmented Generation for Long-form Audio Understanding","primary_cat":"eess.AS","submitted_at":"2026-05-19T19:10:30+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19101","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training","primary_cat":"cs.SD","submitted_at":"2026-05-18T20:41:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GST uses gradient-based affinity metrics to form dataset groups and applies progressive scheduling, achieving 30-40% faster convergence than uniform mixture training on 14 AudioQA datasets while matching or exceeding performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20266","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook","primary_cat":"cs.SD","submitted_at":"2026-05-18T20:21:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"underpinned by their architectural design and the transition from task-specific cascaded systems toward unified, end-to- end multimodal frameworks [17], [42]. Unlike traditional systems characterized by modular decoupling, contempo- rary architectures employ a sophisticated pipeline designed to map continuous, non-stationary auditory signals into structured semantic latent spaces [16], [18]. 2.1 Architectural Foundations The structural integrity of LALMs is established upon a composite information processing pipeline that facilitates the translation of raw acoustic signals into semantic rep- resentations. This architectural framework generally inte- grates three components consisting of an acoustic encoder, an alignment projector, and a LLM backbone."},{"citing_arxiv_id":"2605.18168","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models","primary_cat":"cs.CR","submitted_at":"2026-05-18T10:10:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AIA generates universal interference audio infused with Acoustic Latent Semantics to bypass LALM safety alignment, achieving SOTA attack success rates on 10 models across five datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17225","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Can Large Audio Language Models Ignore Multilingual Distractors? An Evaluation of Their Selective Auditory Attention Capabilities","primary_cat":"eess.AS","submitted_at":"2026-05-17T02:13:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces the MUSA benchmark and evaluates LALMs showing that strong single-speaker performance fails to ensure robust selective attention under multilingual interference, with errors from source confusion and unresolved attribution after separation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16681","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models","primary_cat":"eess.AS","submitted_at":"2026-05-15T22:34:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15984","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues","primary_cat":"cs.SD","submitted_at":"2026-05-15T14:17:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ToxiAlert-Bench dataset and dual-head neural network detect toxic speech by distinguishing textual versus paralinguistic sources, reporting 21.1% Macro-F1 and 13% accuracy gains over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17583","ref_index":113,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech","primary_cat":"cs.CV","submitted_at":"2026-05-14T13:31:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AgentSteerTTS proposes a multi-agent framework with adversarial disentanglement, dual-stream anchoring via acoustic prototypes, and fast-slow feedback to achieve intent-faithful expressive TTS for composite instructions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13672","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification","primary_cat":"cs.CV","submitted_at":"2026-05-13T15:32:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SpurAudio benchmark shows state-of-the-art few-shot audio classifiers suffer large performance drops when background correlations are disrupted, even in large pretrained models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13651","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating","primary_cat":"cs.SD","submitted_at":"2026-05-13T15:09:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NAACA uses a neuro-inspired oscillatory working memory to gate attention in audio language models, raising AudioQwen's average precision from 53.5% to 70.6% on XD-Violence while cutting unnecessary calls.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12242","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-12T15:11:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A sequence-tagger-guided LLM with contrastive objective corrects disfluencies in Hindi, Bengali, and Marathi ASR transcripts, outperforming removal-only baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12036","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model","primary_cat":"eess.AS","submitted_at":"2026-05-12T12:19:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"environments, and adaptive evaluation protocols tailored to different task paradigms. 1) Model Selection and Deployment Setup:We comprehensively evaluate our proposed FM-Speech against 11 advanced speech LLMs, comprising eight mainstream open-source models (Audio Flamingo 3 [8], Qwen3-Omni [7], Kimi-Audio [25], Step-Audio 2 [26], Omni- Captioner [27], Mimo-Audio [28], Qwen2.5-Omni [29], and Qwen2- Audio [30]) and three representative proprietary models (Gemini 2.5 Flash, Gemini 3 Flash, and Gemini 3.1 Pro [10]). To ensure inference fairness and reproducibility, all open-source models (including FM- Speech) are deployed locally. We strictly adhere to their official configuration guidelines, hosting the inference services on compute nodes equipped with 8 NVIDIA L20 GPUs."},{"citing_arxiv_id":"2605.10199","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue","primary_cat":"cs.CL","submitted_at":"2026-05-11T08:46:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of weaker integration.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"a single interleaved sequence of utterances. This corresponds to a half-duplex communication pattern, in which only one party is effectively transmitting information at a time. Many recent LLM-based spoken dialogue systems follow this setup: the model consumes the user's complete speech utterance as input and then autoregressively generates a spoken response [6, 56, 10, 11, 25, 21, 43]. ∗Work done during an internship at SenseTime Research †Corresponding author 3https://light1726.github.io/duplex-demo/ Preprint. arXiv:2605.10199v1 [cs.CL] 11 May 2026 However, real-world spoken dialogue is often not strictly turn-based [40]. Speakers may interrupt one another or produce short backchannels while the other party is still speaking."},{"citing_arxiv_id":"2605.16363","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ORACLE: Anticipating Scams from Partial Trajectories in Streaming App Usage","primary_cat":"cs.LG","submitted_at":"2026-05-09T16:26:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ORACLE is a new agentic framework using adaptive context consolidation and teacher-student distillation to detect emerging scam patterns from incomplete, long-horizon app usage streams across 12 scam types.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07593","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos","primary_cat":"cs.CV","submitted_at":"2026-05-08T11:06:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"We benchmark models in four categories:1) Closed-source OmniLLMs: Gemini 2, Gemini 2.5 [ 17] and Gemini 3;2) Open-source OmniLLMs: Qwen3-Omni [ 54], Qwen2.5-Omni [55], OmniVinci [53], MiniCPM-o [84], HumanOmni [13], Video-SALMONN 2 [15], Baichuan-Omni-1.5 [16], VideoLLaMA2.1 [85], Ming-Flash-Omni [12], and Gemma 4;3) Single- modality MLLMs: video-only Qwen3-VL [ 1] and audio-only Qwen2-Audio [45];4) Visual-only ablations: Ming-Flash-Omni and Qwen3-Omni with the audio stream removed. Evaluation Protocol.For open-source models, we follow official inference configurations and sample as many frames as permitted by the context window to maximize performance. For closed-source models, we use the recommended sampling rate (1 frame per second). A response is correct only if"},{"citing_arxiv_id":"2605.06897","ref_index":53,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes","primary_cat":"cs.CL","submitted_at":"2026-05-07T19:57:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06765","ref_index":56,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing","primary_cat":"cs.CL","submitted_at":"2026-05-07T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06631","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models","primary_cat":"eess.AS","submitted_at":"2026-05-07T17:43:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A statistical sign-off protocol for audio compressors ensures worst-case answer preservation across query families in LALMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05927","ref_index":62,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM","primary_cat":"cs.CL","submitted_at":"2026-05-07T09:32:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"We compare our model with a diverse set of SLM baselines, including leading industrial systems and methods explicitly designed to reduce the modality gap. We calculate the modality gap as the performance difference between a speech model given spoken input and its backbone TLM given textual input. The compared model pairs are GLM-4-V oice [12] and GLM-4-9B [10], Qwen2-Audio [62] and Qwen-7B-Chat [63], DiV A [64] and Llama-3-8B [65], Qwen2.5-Omni [23] and Qwen2.5-7B-Instruct [66], Kimi-Audio [29] and Qwen2.5-7B-Instruct, and SALAD [11] with Qwen2.5-3B&7B-Base. As a reference point, we also include cascaded systems that combine Whisper-large-v3 with Qwen2.5-3B&7B-Instruct. 4.2.2 Modality Gap Results Table 4: Reasoning-heavy V oxEval math performance↑"},{"citing_arxiv_id":"2605.04700","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization","primary_cat":"cs.CR","submitted_at":"2026-05-06T09:52:29+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04613","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models","primary_cat":"cs.SD","submitted_at":"2026-05-06T08:03:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"As a result, building a unified, scalable, and robust singing voice transcription system remains an open challenge. Large Audio Language Models (LALMs) offer a promising foundation for this challenge. Their strong audio-semantic modeling ability makes them attractive for jointly transcribing lyrics and melody within a single autoregressive framework [ 3, 19, 41]. However, existing singing datasets are far smaller than the data typically required to effectively adapt large audio-language models [ 26, 23], limiting their robustness and OOD generalization. To address these challenges, we present VocalParse, a unified and scalable singing voice transcription model built on top of a LALM. First, we introduce SingCrawl, a scalable web-based data pipeline"},{"citing_arxiv_id":"2605.04505","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions","primary_cat":"eess.AS","submitted_at":"2026-05-06T05:18:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"categories of baselines: 1) Non-LLM metrics:We utilize AES model [ 13] with its CE, CU, PC, and PQ metrics, UTMOS [ 11], and NISQA [ 10] as our baselines. 2) General-purpose LLMs:We choose several MLLM models as baselines, including Gemini series ( Gemini-3- Pro, Gemini-2.5-Pro, and Gemini-2.5-Flash) [ 15], Qwen series (Qwen3-omni [28], Qwen2-audio [ 52]), and Nvidia's Audio Flamingo3 [53]. 7https://github.com/vivian556123/Jastin JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 6 TABLE I: Comparison between ourJASTINand baseline models on Speech-only Datasets. Model QualiSpeech SpeechEval Noise Dist. Cont. Listen. Nat. Ovrl. Ovrl. Int. Dist. Dyn. Emo. Art. Subj. Pearson Correlation (PCC↑)"},{"citing_arxiv_id":"2605.03361","ref_index":8,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval","primary_cat":"cs.AI","submitted_at":"2026-05-05T04:44:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02782","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition","primary_cat":"cs.AI","submitted_at":"2026-05-04T16:24:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Current audio-language models fail to use clinical multimodal context for dysarthric speech recognition, but context-aware LoRA fine-tuning delivers large accuracy gains on the SAP dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01766","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time","primary_cat":"cs.LG","submitted_at":"2026-05-03T07:58:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LIME reduces hallucinations in multimodal LLMs by using LRP to boost perceptual modality contributions through inference-time KV updates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01024","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness","primary_cat":"cs.CV","submitted_at":"2026-05-01T18:35:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EmoMM benchmark reveals Video Contribution Collapse in MLLMs for emotion recognition under modality conflict and missingness, mitigated by CHASE head-level attention steering.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"ing samples where exactly one available modality disagrees with the ground truth. Then, for each modalitym∈ M, GTAR is defined as GTARm = P x∈Stc am x cx P x∈Stc amx .(4) Normalized Mean Attention Score (nMAS). Let Km denote the token set of modality m. For a decision query position q, we measure how much headhat layerℓattends to modalitymby nMASℓ,h m = 1 |Km| X k∈Km Aℓ,h q,k,(5) where Aℓ,h q,k is the (softmax-normalized) attention weight from the query token q to key token k in head h of layer ℓ (More details about metrics in Appendix C.1). 4.2 Video Contribution Collapse (RQ1) To evaluate the multimodal integration ability of current MLLM under semantic conflict, we con- duct a differential analysis on EmoMM-Align and EmoMM-Conflict subsets."},{"citing_arxiv_id":"2604.25719","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Step-Audio-R1.5 Technical Report","primary_cat":"eess.AS","submitted_at":"2026-04-28T14:44:30+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24401","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation","primary_cat":"cs.SD","submitted_at":"2026-04-27T12:25:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23323","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss","primary_cat":"cs.CL","submitted_at":"2026-04-25T14:17:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A cross-modal attention refinement module plus hybrid loss improves robustness of audio-text retrieval on noisy and long-form audio.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22133","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and Diagnosis","primary_cat":"eess.AS","submitted_at":"2026-04-24T00:38:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CROTTC-IF is a prompt-free MDD system with monotonic frame-level alignment and implicit knowledge transfer that reaches 71.77% F1 on L2-ARCTIC and 71.70% on Iqra'Eval2.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21766","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA","primary_cat":"cs.CL","submitted_at":"2026-04-23T15:22:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AUDITA is a challenging audio QA benchmark where humans score 32% accuracy on average while state-of-the-art models score below 9%, using IRT to reveal systematic model deficiencies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00025","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis","primary_cat":"q-bio.NC","submitted_at":"2026-04-22T03:02:51+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19949","ref_index":192,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages","primary_cat":"eess.AS","submitted_at":"2026-04-21T19:54:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19565","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps","primary_cat":"cs.CL","submitted_at":"2026-04-21T15:18:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Four attention metrics enable logistic regression classifiers that detect hallucinations in SpeechLLMs with up to +0.23 PR-AUC gains over baselines on ASR and translation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19300","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models","primary_cat":"cs.SD","submitted_at":"2026-04-21T10:05:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18204","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages","primary_cat":"cs.CL","submitted_at":"2026-04-20T12:54:14+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Phoneme-level analysis of ASR on Archi and Rutul shows data scarcity explains recognition errors better than phonological complexity, with language-specific adaptations improving wav2vec2 performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18187","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models","primary_cat":"cs.SD","submitted_at":"2026-04-20T12:43:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"struct curated Chain-of-Thought (CoT) annotations and ﬁne- tune models to imitate them. While effective, these methods are fundamentally limited by the quality and diversity of human- authored reasoning, and cannot discover novel reasoning strate- gies beyond the training data. Reinforcement Learning (RL)- based approaches offer a more ﬂexible alternative. Early meth- ods such as R1-AQA [11], Omni-R1 [12], and AudioMCQ [ 13] apply GRPO [ 14] with accuracy and format rewards, demon- strating that RL can improve audio QA performance. More re- cent works have begun to incorporate reasoning-related signals. For example, Audio-Thinker [ 15] introduces adaptive rewards to guide when the model should reason, and CESAR [ 16] intro- duces a comprehensive suite to reward structured patterns and"},{"citing_arxiv_id":"2604.18159","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FreezeEmpath: Efficient Training for Empathetic Spoken Chatbots with Frozen LLMs","primary_cat":"cs.CL","submitted_at":"2026-04-20T12:22:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FreezeEmpath achieves emotionally expressive speech output and strong performance on empathetic dialogue, speech emotion recognition, and spoken QA tasks by training with a frozen LLM on existing speech datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}