ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.
hub Tool reference
Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix
Tool reference. 80% of classified Pith citations use this work as a method, library, or software dependency, not as a substantive claim.
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 14representative citing papers
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.
AQUA-Bench evaluates audio QA models on three unanswerability scenarios: missing correct answers, mismatched choice sets, and questions irrelevant to the audio.
A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.
MECAT is a multi-expert benchmark for audio AI offering fine-grained captions and QA pairs generated via expert models and LLM reasoning, paired with the DATE metric that combines semantic similarity and cross-sample discriminability to favor detailed outputs.
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
A statistical sign-off protocol for audio compressors ensures worst-case answer preservation across query families in LALMs.
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.
Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoning benchmarks.
Irrelevant audio including silence reduces accuracy and increases volatility in text reasoning for large audio-language models, with effects worsening at longer durations, higher amplitudes, and higher temperatures.
OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.
A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.
citing papers explorer
-
ViMU: Benchmarking Video Metaphorical Understanding
ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.
-
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.
-
AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering
AQUA-Bench evaluates audio QA models on three unanswerability scenarios: missing correct answers, mismatched choice sets, and questions irrelevant to the audio.
-
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.
-
MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks
MECAT is a multi-expert benchmark for audio AI offering fine-grained captions and QA pairs generated via expert models and LLM reasoning, paired with the DATE metric that combines semantic similarity and cross-sample discriminability to favor detailed outputs.
-
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
-
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
-
Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models
A statistical sign-off protocol for audio compressors ensures worst-case answer preservation across query families in LALMs.
-
Qwen3.5-Omni Technical Report
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.
-
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
-
OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering
OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoning benchmarks.
-
When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models
Irrelevant audio including silence reduces accuracy and increases volatility in text reasoning for large audio-language models, with effects worsening at longer durations, higher amplitudes, and higher temperatures.
-
OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization
OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.
-
A Survey of Audio Reasoning in Multimodal Foundation Models
A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.