hub Baseline reference

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Dingdong Wang, Junan Li, Jincenzi Wu, Dongchao Yang, Xueyuan Chen, Tianhua Zhang · 2025 · cs.CL · arXiv 2506.04779

Baseline reference. 78% of citing Pith papers use this work as a benchmark or comparison.

31 Pith papers citing it

Baseline 78% of classified citations

open full Pith review browse 31 citing papers arXiv PDF

abstract

Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in spoken language. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. To ground our benchmark in linguistic theory, we systematically incorporate a wide range of linguistic phenomena, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics. Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization. MMSU establishes a new standard for comprehensive assessment of spoken language understanding, providing valuable insights for developing more sophisticated human-AI speech interaction systems. MMSU benchmark is available at https://huggingface.co/datasets/ddwang2000/MMSU. Evaluation Code is available at https://github.com/dingdongwang/MMSU.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 6 background 2 baseline 1

citation-polarity summary

use dataset 6 background 2 baseline 1

representative citing papers

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

cs.CR · 2026-04-17 · conditional · novelty 8.0

Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

eess.AS · 2026-06-01 · unverdicted · novelty 7.0

SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.

DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action

eess.AS · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

DuplexSLA introduces a three-channel full-duplex architecture that synchronizes continuous user audio, discrete assistant audio, and rate-limited textual actions inside a single backbone for native turn-taking and in-conversation tool use.

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

cs.MM · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

eess.AS · 2026-04-28 · unverdicted · novelty 7.0

Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.

SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

cs.CL · 2026-04-22 · unverdicted · novelty 7.0

SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

cs.SD · 2025-07-10 · unverdicted · novelty 7.0

Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.

Is Text All You Need? Text as a Universal Information Bottleneck for Speech LLMs

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

C-Gate represents speech frames as convex combinations of LLM token embeddings to enforce manifold compatibility, delivering up to 48.7% relative WER reduction on LibriSpeech while preserving emotion recognition accuracy.

Continuous Audio Thinking for Large Audio Language Models

cs.CL · 2026-06-05 · unverdicted · novelty 6.0

CoAT adds a continuous latent thinking space to LALMs via expert distillation to retain acoustic information, yielding gains on audio reasoning, understanding, music, emotion, and transcription benchmarks across three models.

Audio Interaction Model

cs.SD · 2026-06-03 · unverdicted · novelty 6.0

Audio-Interaction unifies offline and online audio tasks into one streaming model via the SoundFlow framework and a new 2.6M-item streaming corpus, enabling real-time instruction following and proactive responses.

Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

cs.CL · 2026-06-03 · unverdicted · novelty 6.0

Speech LLMs fail logical entity-tracking tasks due to binding failures in continuous speech features; Entity-Aware Chain-of-Thought prompting recovers most of the gap even under misrecognition.

Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model

eess.AS · 2026-05-12 · unverdicted · novelty 6.0

A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.

HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

cs.SD · 2026-04-26 · unverdicted · novelty 6.0

HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.

Benchmarking Gaslighting Attacks Against Speech Large Language Models

cs.CL · 2025-09-24 · unverdicted · novelty 6.0

Gaslighting attacks using Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation strategies cause a 24.3% average accuracy drop in Speech LLMs while also triggering behavioral changes like apologies and refusals.

Qwen3-Omni Technical Report

cs.CL · 2025-09-22 · unverdicted · novelty 6.0

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.

From Failure Taxonomy to Intervention: A Diagnostic Methodology for Industry-Scale AVLM in Video and Live-Streaming Platform Moderation

cs.LG · 2026-06-29 · unverdicted · novelty 5.0

Presents a failure taxonomy and intervention-mapping methodology for industry-scale AVLM development in content moderation, applied to a global video platform system.

A Unified and Reproducible Experimentation Framework for Speech Understanding

eess.AS · 2026-05-29 · unverdicted · novelty 5.0

SURE is a new standardized framework for evaluating and training speech foundation models and Speech LLMs to improve comparability and reproducibility under realistic conditions.

Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

cs.CL · 2026-05-26 · unverdicted · novelty 5.0

MAPO is a dual-branch RL framework using modality relevance masks from cross-modal differential entropy and auxiliary attention losses to reduce late-stage modality collapse in audio reasoning models and improve benchmark results.

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

cs.SD · 2026-05-18 · unverdicted · novelty 5.0

A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.

Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models

eess.AS · 2026-05-07 · unverdicted · novelty 5.0

A statistical sign-off protocol for audio compressors ensures worst-case answer preservation across query families in LALMs.

Qwen3.5-Omni Technical Report

cs.CL · 2026-04-17 · unverdicted · novelty 5.0

Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.

citing papers explorer

Showing 31 of 31 citing papers.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos cs.CV · 2026-05-08 · unverdicted · none · ref 61 · internal anchor
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs cs.CR · 2026-04-17 · conditional · none · ref 26 · internal anchor
Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
VoxSafeBench: Not Just What Is Said, but Who, How, and Where cs.SD · 2026-04-16 · unverdicted · none · ref 1 · internal anchor
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing eess.AS · 2026-06-01 · unverdicted · none · ref 13 · internal anchor
SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult
PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects cs.CL · 2026-05-31 · unverdicted · none · ref 83 · internal anchor
PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action eess.AS · 2026-05-20 · unverdicted · none · ref 34 · 2 links · internal anchor
DuplexSLA introduces a three-channel full-duplex architecture that synchronizes continuous user audio, discrete assistant audio, and rate-limited textual actions inside a single backbone for native turn-taking and in-conversation tool use.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation cs.MM · 2026-05-12 · unverdicted · none · ref 55 · 2 links · internal anchor
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models eess.AS · 2026-04-28 · unverdicted · none · ref 29 · internal anchor
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation cs.CL · 2026-04-22 · unverdicted · none · ref 27 · internal anchor
SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models cs.SD · 2025-07-10 · unverdicted · none · ref 108 · internal anchor
Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
Is Text All You Need? Text as a Universal Information Bottleneck for Speech LLMs cs.CL · 2026-06-08 · unverdicted · none · ref 16 · internal anchor
C-Gate represents speech frames as convex combinations of LLM token embeddings to enforce manifold compatibility, delivering up to 48.7% relative WER reduction on LibriSpeech while preserving emotion recognition accuracy.
Continuous Audio Thinking for Large Audio Language Models cs.CL · 2026-06-05 · unverdicted · none · ref 51 · internal anchor
CoAT adds a continuous latent thinking space to LALMs via expert distillation to retain acoustic information, yielding gains on audio reasoning, understanding, music, emotion, and transcription benchmarks across three models.
Audio Interaction Model cs.SD · 2026-06-03 · unverdicted · none · ref 31 · internal anchor
Audio-Interaction unifies offline and online audio tasks into one streaming model via the SoundFlow framework and a new 2.6M-item streaming corpus, enabling real-time instruction following and proactive responses.
Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention cs.CL · 2026-06-03 · unverdicted · none · ref 43 · internal anchor
Speech LLMs fail logical entity-tracking tasks due to binding failures in continuous speech features; Entity-Aware Chain-of-Thought prompting recovers most of the gap even under misrecognition.
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model eess.AS · 2026-05-12 · unverdicted · none · ref 23 · internal anchor
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models cs.SD · 2026-04-26 · unverdicted · none · ref 22 · internal anchor
HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
Benchmarking Gaslighting Attacks Against Speech Large Language Models cs.CL · 2025-09-24 · unverdicted · none · ref 17 · internal anchor
Gaslighting attacks using Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation strategies cause a 24.3% average accuracy drop in Speech LLMs while also triggering behavioral changes like apologies and refusals.
Qwen3-Omni Technical Report cs.CL · 2025-09-22 · unverdicted · none · ref 27 · internal anchor
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.
From Failure Taxonomy to Intervention: A Diagnostic Methodology for Industry-Scale AVLM in Video and Live-Streaming Platform Moderation cs.LG · 2026-06-29 · unverdicted · none · ref 31 · internal anchor
Presents a failure taxonomy and intervention-mapping methodology for industry-scale AVLM development in content moderation, applied to a global video platform system.
A Unified and Reproducible Experimentation Framework for Speech Understanding eess.AS · 2026-05-29 · unverdicted · none · ref 40 · internal anchor
SURE is a new standardized framework for evaluating and training speech foundation models and Speech LLMs to improve comparability and reproducibility under realistic conditions.
Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization cs.CL · 2026-05-26 · unverdicted · none · ref 25 · internal anchor
MAPO is a dual-branch RL framework using modality relevance masks from cross-modal differential entropy and auxiliary attention losses to reduce late-stage modality collapse in audio reasoning models and improve benchmark results.
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook cs.SD · 2026-05-18 · unverdicted · none · ref 189 · internal anchor
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models eess.AS · 2026-05-07 · unverdicted · none · ref 24 · internal anchor
A statistical sign-off protocol for audio compressors ensures worst-case answer preservation across query families in LALMs.
Qwen3.5-Omni Technical Report cs.CL · 2026-04-17 · unverdicted · none · ref 40 · internal anchor
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models eess.AS · 2026-04-14 · unverdicted · none · ref 47 · internal anchor
Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering cs.CV · 2026-04-09 · unverdicted · none · ref 36 · internal anchor
OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoning benchmarks.
Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026 cs.CL · 2026-06-03 · unverdicted · none · ref 24 · internal anchor
KIT's IWSLT submission uses segment concatenation, LLM label generation and cross-lingual translation to create >1M long-form training instances and shows that likelihood re-ranking harms semantic tasks unless combined with Minimum Bayes Risk decoding.
MOSS-Audio Technical Report cs.SD · 2026-06-01 · unverdicted · none · ref 1 · internal anchor
MOSS-Audio is an audio-language model using a 12.5 Hz encoder, DeepStack cross-layer injection, time markers, and an event-preserving annotation pipeline for unified audio understanding.
Step-Audio-R1.5 Technical Report eess.AS · 2026-04-28 · unverdicted · none · ref 16 · internal anchor
Step-Audio-R1.5 applies RLHF to audio reasoning models to escape the verifiable reward trap of RLVR, preserving analytical ability while restoring prosodic naturalness and immersion in long dialogues.
A Survey of Audio Reasoning in Multimodal Foundation Models eess.AS · 2026-05-20 · unverdicted · none · ref 120 · internal anchor
A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning eess.AS · 2026-03-18 · unreviewed · ref 34 · 2 links · internal anchor

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer