{"total":22,"items":[{"citing_arxiv_id":"2606.01016","ref_index":83,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects","primary_cat":"cs.CL","submitted_at":"2026-05-31T05:13:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27741","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization","primary_cat":"cs.CL","submitted_at":"2026-05-26T22:34:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MAPO is a dual-branch RL framework using modality relevance masks from cross-modal differential entropy and auxiliary attention losses to reduce late-stage modality collapse in audio reasoning models and improve benchmark results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21008","ref_index":120,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey of Audio Reasoning in Multimodal Foundation Models","primary_cat":"eess.AS","submitted_at":"2026-05-20T10:44:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"as it requires understanding and executing semantic intent. Recent settings [113], [116], [117] further introduce content- preserving acoustic variations, such as emotion, speaking rate, or background noise. These aim not to test reasoning over acoustic cues, but to evaluate whether content-based reasoning remains stable under such perturbations. In contrast, acoustic-based reasoning [118]-[120] involves tasks where correct inference depends on information not reducible to text, including prosody, emotion, speaker traits, emphasis, hesitation, and other paralinguistic signals. Text transcripts alone are insufficient, and the model must use acoustic evidence. Thus, acoustic-based reasoning more di- rectly tests whether models leverage speech-specific infor-"},{"citing_arxiv_id":"2605.20755","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action","primary_cat":"eess.AS","submitted_at":"2026-05-20T05:54:08+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20266","ref_index":189,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook","primary_cat":"cs.SD","submitted_at":"2026-05-18T20:21:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"AudioJailbreak [168] May 2025 ASR ✗ ✗ ✗ ✗ ✗ ✗✓✗ ✗ AJailBench [169] May 2025 ASR, TS, PV , Relevance, Similarity✗ ✗ ✗ ✗ ✗ ✗✓✗ ✗ VocalAgent [188] May 2025 macro-F1, Accuracy, FPR, Refusal Rate, Goodness@0.1✓✗ ✗ ✗ ✗ ✗ ✗✓✗ AudioTrust [174] May 2025 GPT-4o Score, CM-WER, CCR, DSR, HRR, FAR, SES, Refusal Rate, Group Unfairness Score✗ ✗ ✗✓ ✓ ✓ ✓ ✓ ✓ MMSU [189] Jun 2025 Accuracy ✓ ✓✗ ✗ ✗ ✗ ✗ ✗ ✗ SOVA-Bench [190] Jun 2025 Accuracy, WER, GPTEval, LLM-as-a-Judge, UTMOSv2✓ ✓ ✓✗ ✗ ✗ ✗ ✗ ✗ WildSpeech-Bench [191] Jun 2025 LLM-as-a-Judge, UTMOS, Query-Aware Checklist✓ ✓ ✓✗ ✗ ✗ ✗✓✗ ContextASR-Bench [192] Jul 2025 WER, NE-WER, NE-FNR✓✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ C3 [193] Jul 2025 LLM-as-a-Judge, Human Evaluation Score✓ ✓ ✓✗ ✗ ✗ ✗✓✗"},{"citing_arxiv_id":"2605.12036","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model","primary_cat":"eess.AS","submitted_at":"2026-05-12T12:19:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"instances surpass existing benchmarks, ensuring statistically robust evaluation. Furthermore, it introduces in-depth tasks such as micro- acoustic cue perception, and linguistic-paralinguistic integration, to enable a comprehensive, fine-grained evaluation of real-world speech across multiple dimensions. TABLE II COMPARISON OF SPEECH-RELATED BENCHMARKS. Abbr. AIR [4] MMAR [5] MMAU [6] MMSU [23] HPSU [24] Ours num 19k 1k 10k 5k 20k+ 24k+ GEN✓ ✗ ✓ ✓ ✓ ✓ AGE✓ ✗ ✓ ✓ ✓ ✓ ACC◦✓◦✓ ✓ ✓ PIT◦ ◦ ◦ ◦✗ ✓ SR✗ ✗◦ ◦✗ ✓ RHY✗◦ ◦ ◦✗ ✓ VT✗ ✗◦✗ ✗ ✓ EMO✓ ✓ ✓ ✓ ✓ ✓ TON◦ ◦✓ ✓◦✓ CI◦✓ ✓ ✓ ✓ ✓ BS✓ ✓◦✓ ✗ ✓ AE✓ ✓ ✓◦✗ ✓ PE✓◦ ◦✓◦✓ TPT✗ ✗ ✗ ✗ ✗ ✓ D. FM-Speech Leveraging the high-quality, fine-grained corpus generated by our data curation pipeline, we introduceFM-Speech, built upon the"},{"citing_arxiv_id":"2605.12034","ref_index":55,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation","primary_cat":"cs.MM","submitted_at":"2026-05-12T12:16:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[53] Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction- following evaluation for large language models, 2023. URLhttps://arxiv.org/abs/2311.07911. [54] Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. URLhttps://arxiv.org/abs/2307.15043. [55] Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T. Tan, and Haizhou Li. V oicebench: Benchmarking llm-based voice assistants, 2024. URLhttps://arxiv.org/abs/2410.17196. [56] S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi-task audio understanding and reasoning benchmark, 2024."},{"citing_arxiv_id":"2605.07593","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos","primary_cat":"cs.CV","submitted_at":"2026-05-08T11:06:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"[60] Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy Chen. Audiobench: A universal benchmark for audio large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4297-4316, 2025. 13 [61] Dingdong Wang, Junan Li, Jincenzi Wu, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. Mmsu: A massive multi-task spoken language understanding and reasoning benchmark.arXiv preprint arXiv:2506.04779, 2025. [62] Yingzhi Wang, Pooneh Mousavi, Artem Ploujnikov, and Mirco Ravanelli. What are they doing? joint audio-speech co-reasoning. InICASSP 2025-2025 IEEE International Conference on"},{"citing_arxiv_id":"2605.06631","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models","primary_cat":"eess.AS","submitted_at":"2026-05-07T17:43:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A statistical sign-off protocol for audio compressors ensures worst-case answer preservation across query families in LALMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25719","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Step-Audio-R1.5 Technical Report","primary_cat":"eess.AS","submitted_at":"2026-04-28T14:44:30+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25591","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models","primary_cat":"eess.AS","submitted_at":"2026-04-28T12:56:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"We hope this study provides a foundation for future research on reliable, uncertainty-aware audio-language systems. Index Terms-uncertainty estimation, audio-aware LLMs I. INTRODUCTION Recent audio-aware large language models (ALLMs) [1]- [25] have rapidly advanced across a wide range of tasks, including audio understanding [26]-[32], spoken question answering [29], [33], [34], music reasoning , and general audio-language interaction [35]-[52]. By conditioning lan- guage generation on audio inputs, these models extend the capabilities of text-only LLMs to richer multimodal scenarios. However, strong performance does not necessarily imply relia- bility. In practice, ALLMs still frequently produce unsupported"},{"citing_arxiv_id":"2604.23717","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models","primary_cat":"cs.SD","submitted_at":"2026-04-26T14:00:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20842","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation","primary_cat":"cs.CL","submitted_at":"2026-04-22T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16659","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs","primary_cat":"cs.CR","submitted_at":"2026-04-17T19:28:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15804","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Qwen3.5-Omni Technical Report","primary_cat":"cs.CL","submitted_at":"2026-04-17T08:05:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14548","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VoxSafeBench: Not Just What Is Said, but Who, How, and Where","primary_cat":"cs.SD","submitted_at":"2026-04-16T02:24:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"environmental cues so that benchmark failures are less likely to be driven by weak or noisy audio evidence. All synthesized audio is filtered with Whisper-large-v3 [30], discarding samples with WER > 5%. Full construction details, including the prompt-audio pool composition and quality control pipeline, are provided in Appendix D. Evaluation Models and JudgingWe curate a model set with demonstrated strong audio understand- ing. Guided by MMSU [1] and MMAU-Pro [2], two benchmarks that emphasize paralinguistic and sound-mixture reasoning, we select open models (Qwen3-Omni [31], Mimo-Audio [32], Kimi-Audio [33]) and closed models (Gemini-3-Pro, Gemini-3-Flash, GPT-4o-Audio [34]). For Qwen3-Omni and Mimo-Audio, we additionally evaluate their thinking variants, as reasoning modes can exhibit noticeably different alignment behaviors."},{"citing_arxiv_id":"2604.12527","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models","primary_cat":"eess.AS","submitted_at":"2026-04-14T10:00:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08209","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering","primary_cat":"cs.CV","submitted_at":"2026-04-09T13:09:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoning benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"14 70.70 ↑2.20 OmniJigsaw (JMI) 58.33 ↑1.72 74.50 ↑0.10 69.80 ↓0.36 69.10 ↑0.60 OmniJigsaw (SMS) 58.46 ↑1.85 75.80 ↑1.40 70.48 ↑0.32 69.50 ↑1.00 OmniJigsaw (CMM) 58.59 ↑1.98 76.30 ↑1.90 70.70 ↑0.54 71.00 ↑2.50 Audio ReasoningTo evaluate audio understanding improvements facilitated by our OmniJigsaw, we employ four representative bench- marks: MMSU [36] for fine-grained perception, MMAU-test-mini [28] and MMAR [21] for hierarchical reasoning, and MMAU-Pro [13] for versatile auditory comprehension. As shown in Table 2, OmniJigsaw yields consistent improvements; no- tably, CMM outperforms AudioJig- saw despite the latter's exclusive audio attention, validating its effi- cacy in excavating mutually bene-"},{"citing_arxiv_id":"2603.17837","ref_index":34,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning","primary_cat":"eess.AS","submitted_at":"2026-03-18T15:30:29+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.19858","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Benchmarking Gaslighting Attacks Against Speech Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-09-24T07:57:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gaslighting attacks using Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation strategies cause a 24.3% average accuracy drop in Speech LLMs while also triggering behavioral changes like apologies and refusals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.17765","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Qwen3-Omni Technical Report","primary_cat":"cs.CL","submitted_at":"2025-09-22T13:26:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.08128","ref_index":108,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models","primary_cat":"cs.SD","submitted_at":"2025-07-10T19:40:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"0Audio Flamingo 3 1.86 TEDLIUM (en) Phi-4-mm WER↓ 2.9Audio Flamingo 3 3.5 GigaSpeech (en) Phi-4-mm WER↓ 9.78Audio Flamingo 3 10.27 Common Voice 15 (en) Phi-4-mm WER↓ 7.61Audio Flamingo 3 7.4 VoxPopuli (en) Phi-4-mm WER↓ 5.91Audio Flamingo 3 5.55 focused audio QA (MMAU [ 101] (v05.15.25), MuchoMusic (perceptual version) [ 120, 110], MMAR [81], MMSU [108], CompA-R-test [42], Audio Entailment [29]), multimodal hallucination detection (CMM [72]), audio captioning (Clotho-v2 [32], AudioCaps [60]), ASR (Librispeech (clean and other) [92], SPGISpeech [90], TEDLIUM [100, 49], GigaSpeech (Large) [14], Common V oice 15 [5] and V oxpopuli [107]) and long audio captioning and QA (LongAudioBench - which we 8"}],"limit":50,"offset":0}