TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
hub Baseline reference
MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
Baseline reference. 78% of citing Pith papers use this work as a benchmark or comparison.
abstract
Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in spoken language. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. To ground our benchmark in linguistic theory, we systematically incorporate a wide range of linguistic phenomena, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics. Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization. MMSU establishes a new standard for comprehensive assessment of spoken language understanding, providing valuable insights for developing more sophisticated human-AI speech interaction systems. MMSU benchmark is available at https://huggingface.co/datasets/ddwang2000/MMSU. Evaluation Code is available at https://github.com/dingdongwang/MMSU.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult
PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.
DuplexSLA introduces a three-channel full-duplex architecture that synchronizes continuous user audio, discrete assistant audio, and rate-limited textual actions inside a single backbone for native turn-taking and in-conversation tool use.
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.
SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
C-Gate represents speech frames as convex combinations of LLM token embeddings to enforce manifold compatibility, delivering up to 48.7% relative WER reduction on LibriSpeech while preserving emotion recognition accuracy.
CoAT adds a continuous latent thinking space to LALMs via expert distillation to retain acoustic information, yielding gains on audio reasoning, understanding, music, emotion, and transcription benchmarks across three models.
Audio-Interaction unifies offline and online audio tasks into one streaming model via the SoundFlow framework and a new 2.6M-item streaming corpus, enabling real-time instruction following and proactive responses.
Speech LLMs fail logical entity-tracking tasks due to binding failures in continuous speech features; Entity-Aware Chain-of-Thought prompting recovers most of the gap even under misrecognition.
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
Gaslighting attacks using Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation strategies cause a 24.3% average accuracy drop in Speech LLMs while also triggering behavioral changes like apologies and refusals.
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.
Presents a failure taxonomy and intervention-mapping methodology for industry-scale AVLM development in content moderation, applied to a global video platform system.
SURE is a new standardized framework for evaluating and training speech foundation models and Speech LLMs to improve comparability and reproducibility under realistic conditions.
MAPO is a dual-branch RL framework using modality relevance masks from cross-modal differential entropy and auxiliary attention losses to reduce late-stage modality collapse in audio reasoning models and improve benchmark results.
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
A statistical sign-off protocol for audio compressors ensures worst-case answer preservation across query families in LALMs.
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.
citing papers explorer
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs
Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
-
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
-
SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing
SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult
-
PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects
PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.
-
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action
DuplexSLA introduces a three-channel full-duplex architecture that synchronizes continuous user audio, discrete assistant audio, and rate-limited textual actions inside a single backbone for native turn-taking and in-conversation tool use.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.
-
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
-
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
-
Is Text All You Need? Text as a Universal Information Bottleneck for Speech LLMs
C-Gate represents speech frames as convex combinations of LLM token embeddings to enforce manifold compatibility, delivering up to 48.7% relative WER reduction on LibriSpeech while preserving emotion recognition accuracy.
-
Continuous Audio Thinking for Large Audio Language Models
CoAT adds a continuous latent thinking space to LALMs via expert distillation to retain acoustic information, yielding gains on audio reasoning, understanding, music, emotion, and transcription benchmarks across three models.
-
Audio Interaction Model
Audio-Interaction unifies offline and online audio tasks into one streaming model via the SoundFlow framework and a new 2.6M-item streaming corpus, enabling real-time instruction following and proactive responses.
-
Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention
Speech LLMs fail logical entity-tracking tasks due to binding failures in continuous speech features; Entity-Aware Chain-of-Thought prompting recovers most of the gap even under misrecognition.
-
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
-
HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models
HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
-
Benchmarking Gaslighting Attacks Against Speech Large Language Models
Gaslighting attacks using Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation strategies cause a 24.3% average accuracy drop in Speech LLMs while also triggering behavioral changes like apologies and refusals.
-
Qwen3-Omni Technical Report
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.
-
From Failure Taxonomy to Intervention: A Diagnostic Methodology for Industry-Scale AVLM in Video and Live-Streaming Platform Moderation
Presents a failure taxonomy and intervention-mapping methodology for industry-scale AVLM development in content moderation, applied to a global video platform system.
-
A Unified and Reproducible Experimentation Framework for Speech Understanding
SURE is a new standardized framework for evaluating and training speech foundation models and Speech LLMs to improve comparability and reproducibility under realistic conditions.
-
Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization
MAPO is a dual-branch RL framework using modality relevance masks from cross-modal differential entropy and auxiliary attention losses to reduce late-stage modality collapse in audio reasoning models and improve benchmark results.
-
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
-
Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models
A statistical sign-off protocol for audio compressors ensures worst-case answer preservation across query families in LALMs.
-
Qwen3.5-Omni Technical Report
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.
-
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
-
OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering
OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoning benchmarks.
-
Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026
KIT's IWSLT submission uses segment concatenation, LLM label generation and cross-lingual translation to create >1M long-form training instances and shows that likelihood re-ranking harms semantic tasks unless combined with Minimum Bayes Risk decoding.
-
MOSS-Audio Technical Report
MOSS-Audio is an audio-language model using a 12.5 Hz encoder, DeepStack cross-layer injection, time markers, and an event-preserving annotation pipeline for unified audio understanding.
-
Step-Audio-R1.5 Technical Report
Step-Audio-R1.5 applies RLHF to audio reasoning models to escape the verifiable reward trap of RLVR, preserving analytical ability while restoring prosodic naturalness and immersion in long dialogues.
-
A Survey of Audio Reasoning in Multimodal Foundation Models
A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.
- The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning