arxiv: 2509.17765 · v1 · submitted 2025-09-22 · 💻 cs.CL · cs.AI· cs.CV· eess.AS

Recognition: 2 theorem links

· Lean Theorem

Qwen3-Omni Technical Report

Jin Xu , Zhifang Guo , Hangrui Hu , Yunfei Chu , Xiong Wang , Jinzheng He , Yuxuan Wang , Xian Shi

show 30 more authors

Ting He Xinfa Zhu Yuanjun Lv Yongqi Wang Dake Guo He Wang Linhan Ma Pei Zhang Xinyu Zhang Hongkun Hao Zishan Guo Baosong Yang Bin Zhang Ziyang Ma Xipin Wei Shuai Bai Keqin Chen Xuejing Liu Peng Wang Mingkun Yang Dayiheng Liu Xingzhang Ren Bo Zheng Rui Men Fan Zhou Bowen Yu Jianxin Yang Le Yu Jingren Zhou Junyang Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CVeess.AS

keywords multimodal AIQwen3-Omniaudio-visual modelspeech synthesismixture of expertsstreaming generationmultilingual support

0 comments

The pith

Qwen3-Omni maintains state-of-the-art performance on text, image, audio, and video tasks in a single model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Qwen3-Omni as a unified multimodal model that achieves top performance across multiple input types without the usual trade-offs. It uses a Thinker-Talker mixture-of-experts setup to handle both understanding and generation for text, images, audio, and video. On audio tasks it leads on most benchmarks while also supporting real-time speech output with low latency. This suggests it is possible to build one model that does not sacrifice accuracy when adding more capabilities.

Core claim

Qwen3-Omni maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. It adopts a Thinker-Talker MoE architecture that unifies perception and generation, yielding fluent text and natural real-time speech. Across 36 audio and audio-visual benchmarks, it achieves open-source SOTA on 32 and overall SOTA on 22, outperforming closed-source models like Gemini-2.5-Pro.

What carries the argument

The Thinker-Talker MoE architecture, which separates thinking and talking components to unify multimodal perception and generation, combined with multi-codebook discrete speech codecs for low-latency streaming synthesis.

If this is right

Matches performance of same-sized single-modal Qwen models on all modalities.
Excels on audio tasks, leading 32 out of 36 benchmarks.
Supports text in 119 languages, speech understanding in 19, and generation in 10.
Enables theoretical end-to-end first-packet latency of 234 ms for streaming speech.
Provides a fine-tuned Captioner variant for detailed audio descriptions with low hallucination.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could allow future models to integrate even more modalities without performance loss.
The low-latency streaming method might extend to other generative tasks beyond speech.
Releasing the Captioner model could accelerate development of better audio analysis tools.
The Thinking model variant demonstrates explicit reasoning over any input modality.

Load-bearing premise

The selected 36 audio and audio-visual benchmarks represent real-world multimodal performance without bias from benchmark choice or evaluation setup.

What would settle it

Results on a new, independently designed set of multimodal benchmarks where Qwen3-Omni shows clear degradation compared to specialized single-modal models.

read the original abstract

We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Qwen3-Omni delivers a practical unified multimodal model with real engineering wins on streaming audio, but the no-degradation claim across modalities rests on unshown evaluation details.

read the letter

The main thing here is that Qwen3-Omni actually ships a single model that handles text, image, audio, and video while claiming to match the single-modal Qwen baselines on the non-audio parts. They introduce a Thinker-Talker MoE split plus a causal ConvNet that replaces diffusion for speech synthesis, which cuts first-packet latency to a theoretical 234 ms. That part looks like a concrete improvement over prior streaming approaches in the series. They also release a separate captioner variant and a Thinking model for explicit reasoning, and they open-source the 30B-A3B versions under Apache 2.0 with support for 119 languages on text and solid numbers on 19/10 for speech understanding and generation. The 36 audio benchmarks where they hit open-source SOTA on 32 and overall SOTA on 22 against Gemini-2.5-Pro and GPT-4o-Transcribe is the headline result they want people to notice. On the positive side, the architecture choices are described clearly enough to be reproducible in principle, and the multilingual scope plus the low-hallucination captioner fill a real gap. The soft spot is exactly what the stress test flagged: the abstract and reported numbers give no prompt templates, decoding settings, data versions, or confirmation that the single-modal baselines were run under identical conditions. Without those tables and ablations in the full text, the central no-degradation claim is hard to assess. The paper is aimed at people building real-time multimodal systems or extending Qwen-style models, and it is worth a serious referee because the open weights and the specific latency fix make it usable even if the benchmark comparisons need tightening. I would send it to review with a request for the full evaluation protocol and any internal ablations they ran.

Referee Report

2 major / 2 minor

Summary. The paper introduces Qwen3-Omni, a unified multimodal model using a Thinker-Talker MoE architecture for perception and generation across text, image, audio, and video. It claims to match same-sized single-modal Qwen models with no degradation on text/image/video tasks while achieving open-source SOTA on 32 of 36 audio/audio-visual benchmarks and overall SOTA on 22, outperforming closed-source systems such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Additional contributions include multi-language support (119 text, 19 speech understanding, 10 speech generation), a multi-codebook streaming synthesis method yielding 234 ms theoretical first-packet latency via causal ConvNet, a Thinking model for multimodal reasoning, and a fine-tuned audio captioner variant; the 30B-A3B, Thinking, and Captioner models are released under Apache 2.0.

Significance. If the no-degradation and SOTA claims are substantiated by controlled, reproducible evaluations, the work would represent a meaningful advance in unified multimodal systems by showing that a single model can avoid typical cross-modal trade-offs while adding practical streaming and captioning capabilities. The open release and focus on audio excellence would facilitate community follow-up and applications in multilingual settings.

major comments (2)

[Abstract] Abstract: the central claim that Qwen3-Omni 'maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts' and 'matches the performance of same-sized single-modal models within the Qwen series' is load-bearing, yet no quantitative tables, error bars, ablation results, or protocol details (prompt templates, decoding, data versions) are referenced to support direct head-to-head comparisons under identical conditions.
[Abstract] Abstract (audio benchmarks paragraph): the assertion of open-source SOTA on 32/36 and overall SOTA on 22 benchmarks versus Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe rests on unstated evaluation equivalence; without disclosed re-evaluation of baselines under the same setup or exclusion rules, the cross-model superiority cannot be verified and directly affects the 'excels particularly on audio tasks' contribution.

minor comments (2)

[Abstract] The abstract lists language support counts (119/19/10) but does not indicate whether these are supported in all modalities or only specific ones; a clarifying sentence or table would improve precision.
[Abstract] The multi-codebook streaming mechanism and replacement of block-wise diffusion by causal ConvNet are described at high level; a short diagram or pseudocode would aid reproducibility of the 234 ms latency claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and for recognizing the potential impact of Qwen3-Omni. We address the two major comments on the abstract below, providing point-by-point clarifications drawn from the full manuscript and committing to targeted revisions that improve transparency without altering the reported results.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that Qwen3-Omni 'maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts' and 'matches the performance of same-sized single-modal models within the Qwen series' is load-bearing, yet no quantitative tables, error bars, ablation results, or protocol details (prompt templates, decoding, data versions) are referenced to support direct head-to-head comparisons under identical conditions.

Authors: We appreciate this observation. The manuscript contains the requested quantitative support in Sections 4 and 5. Section 4 presents head-to-head comparisons on text, image, and video benchmarks (Tables 1–4) against the corresponding single-modal Qwen2.5 and Qwen2 models of matching size, with per-task scores, standard deviations where multiple seeds were run, and explicit statements that no degradation occurs. Section 5 extends this to audio and audio-visual tasks (Tables 5–8). Ablations on the Thinker-Talker MoE routing, modality-specific adapters, and codebook usage appear in Section 6. Full protocol details—including prompt templates, decoding parameters (temperature, top-p), data versions, and benchmark splits—are provided in Section 3.3 and the appendix. To address the referee’s concern directly, we will revise the abstract to include explicit cross-references (e.g., “as shown in Tables 2 and 5 and detailed in Section 3.3”). This change makes the load-bearing claim traceable while preserving the abstract’s brevity. revision: yes
Referee: [Abstract] Abstract (audio benchmarks paragraph): the assertion of open-source SOTA on 32/36 and overall SOTA on 22 benchmarks versus Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe rests on unstated evaluation equivalence; without disclosed re-evaluation of baselines under the same setup or exclusion rules, the cross-model superiority cannot be verified and directly affects the 'excels particularly on audio tasks' contribution.

Authors: We agree that evaluation equivalence must be stated clearly. The 32/36 open-source SOTA and 22 overall SOTA counts are derived from the standardized benchmark suite described in Section 5. For open-source models we report our own runs under identical prompts and decoding settings; for closed-source systems (Gemini-2.5-Pro, GPT-4o-Transcribe, Seed-ASR) we used the latest publicly released API versions with the exact same benchmark inputs and post-processing rules as our model. Any exclusions (e.g., language-specific subsets or modality mismatches) are enumerated in the appendix table that accompanies each benchmark. We will add a concise clarifying clause to the abstract (“evaluated under consistent protocols; see Section 5 and Appendix B”) and expand the evaluation paragraph in Section 5 to list the precise API versions, prompt templates, and exclusion criteria used for each baseline. These revisions will allow independent verification of the audio-task superiority claim. revision: partial

Circularity Check

0 steps flagged

No circularity; performance claims rest on external benchmark comparisons

full rationale

The paper presents Qwen3-Omni as a multimodal model whose central claims are empirical: it matches single-modal Qwen baselines and achieves SOTA on 32 of 36 audio benchmarks while outperforming closed-source models. These results are reported as direct evaluations rather than derived quantities. No equations, fitted parameters renamed as predictions, or self-referential definitions appear in the architecture description (Thinker-Talker MoE) or latency techniques. The fine-tuning for the Captioner variant is an explicit post-training step, not a circular derivation. Any self-citations (if present in the full text) are not load-bearing for the performance assertions, which rely on external benchmarks. The derivation chain is therefore self-contained against independent test sets.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond standard large-model training assumptions; the design implicitly assumes that multimodal data can be unified via MoE without trade-offs and that discrete speech codecs plus causal ConvNet suffice for low-latency generation.

free parameters (2)

MoE expert count and routing parameters
Chosen to allocate capacity across text, image, audio, and video modalities during training.
Multi-codebook speech codec configuration
Selected to enable autoregressive prediction and low first-packet latency.

axioms (1)

domain assumption A single model can match specialized single-modal performance across modalities when using appropriate architecture and training.
Central premise underlying the no-degradation claim.

pith-pipeline@v0.9.0 · 5825 in / 1466 out tokens · 79190 ms · 2026-05-11T00:15:39.055048+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search
cs.SD 2026-05 unverdicted novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
cs.CV 2026-05 unverdicted novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees
cs.CV 2026-04 unverdicted novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
cs.SD 2026-04 unverdicted novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 7.0

Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries
cs.MM 2026-05 unverdicted novelty 7.0

FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
cs.CV 2026-05 unverdicted novelty 7.0

Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
cs.CL 2026-05 unverdicted novelty 7.0

MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization
cs.CV 2026-05 unverdicted novelty 7.0

DPOFusion uses direct preference optimization on property-aligned and preference-controllable latent diffusion models to produce adaptive infrared-visible image fusions aligned with heterogeneous human and machine vis...
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
cs.CL 2026-05 unverdicted novelty 7.0

TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
cs.CR 2026-05 conditional novelty 7.0

Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.
EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs
cs.CV 2026-04 unverdicted novelty 7.0

EmoTrans is a new video benchmark with four progressive tasks that measures how well current multimodal LLMs handle dynamic emotion transitions rather than static recognition.
StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning
cs.AI 2026-04 unverdicted novelty 7.0

StoryTR is a new benchmark and agentic data pipeline that adds explicit Theory of Mind reasoning chains to train smaller video retrieval models, yielding a 15% relative IoU gain over larger baselines on narrative content.
Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding
eess.AS 2026-04 unverdicted novelty 7.0

LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
cs.CL 2026-04 unverdicted novelty 7.0

SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
cs.SD 2026-04 unverdicted novelty 7.0

Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench
cs.AI 2026-04 unverdicted novelty 7.0

ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.
Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
eess.AS 2026-04 unverdicted novelty 7.0

HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding
cs.CV 2026-04 unverdicted novelty 7.0

MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.
OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video
cs.CV 2026-04 unverdicted novelty 7.0

OmniScript is a new 8B omni-modal model that turns long cinematic videos into scene-by-scene scripts and matches top proprietary models on temporal localization and semantic accuracy.
VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories
cs.SD 2026-04 unverdicted novelty 7.0

VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.
SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos
cs.CV 2026-04 unverdicted novelty 7.0

SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
cs.SD 2026-04 unverdicted novelty 7.0

CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
eess.AS 2026-04 unverdicted novelty 7.0

Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware ca...
KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness
cs.CL 2026-03 unverdicted novelty 7.0

KoALa-Bench is a new public benchmark with six tasks that tests Korean speech recognition, translation, question answering, instruction following, and faithfulness in large audio language models.
DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning
cs.CV 2026-03 unverdicted novelty 7.0

A new 1695-sample multicultural dataset plus two modules for stable multimodal fusion and modality consistency yield state-of-the-art deception detection with cross-cultural transfer.
TiCo: Time-Controllable Spoken Dialogue Model
cs.CL 2026-03 unverdicted novelty 7.0

TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs
cs.CL 2026-03 unverdicted novelty 7.0

OmniTrace converts token-level signals into span-level cross-modal attributions for open-ended generation in omni-modal LLMs via generation-time tracing.
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
cs.SD 2026-05 unverdicted novelty 6.0

SpeakerLLM unifies speaker profiling, recording-condition understanding, and structured verification reasoning in an audio-LLM via a hierarchical tokenizer and decision traces.
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
eess.AS 2026-05 unverdicted novelty 6.0

A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 6.0

Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
Meow-Omni 1: A Multimodal Large Language Model for Feline Ethology
cs.CL 2026-05 unverdicted novelty 6.0

Meow-Omni 1 is a quad-modal MLLM that fuses video, audio, physiological time-series, and text to achieve 71.16% accuracy on feline intent recognition in the new MeowBench benchmark.
NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics
cs.CV 2026-05 unverdicted novelty 6.0

VLMs fail to identify visual preconditions or apply physical laws in kinematic physics tasks, as shown by new FACT diagnostics and NICE calibration methods evaluated on six state-of-the-art models.
Head Similarity: Modeling Structured Whole-Head Appearance Beyond Face Recognition
cs.CV 2026-05 unverdicted novelty 6.0

Head Similarity extends identity recognition to structured whole-head similarity by capturing intra-identity appearance variations via hierarchical supervision on a weakly-labeled video benchmark.
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
cs.SD 2026-05 unverdicted novelty 6.0

VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
eess.AS 2026-05 unverdicted novelty 6.0

JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.
When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition
cs.AI 2026-05 unverdicted novelty 6.0

Current audio-language models fail to use clinical multimodal context for dysarthric speech recognition, but context-aware LoRA fine-tuning delivers large accuracy gains on the SAP dataset.
CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness
q-bio.NC 2026-04 unverdicted novelty 6.0

CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
cs.CL 2026-04 unverdicted novelty 6.0

MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation
cs.SD 2026-04 unverdicted novelty 6.0

Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.
HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models
cs.SD 2026-04 unverdicted novelty 6.0

HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis
cs.CL 2026-04 unverdicted novelty 6.0

TTS-PRISM defines a 12-dimensional perceptual schema, builds a targeted diagnostic dataset via adversarial synthesis and expert labels, and tunes an end-to-end model that outperforms generalist LLMs in human alignment...
OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

OmniHuman is a new large-scale multi-scene dataset with video-, frame-, and individual-level annotations for human-centric video generation, accompanied by the OHBench benchmark that adds metrics aligned with human pe...
Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
cs.AI 2026-04 unverdicted novelty 6.0

Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
cs.CV 2026-04 unverdicted novelty 6.0

AVRT transfers reasoning to audio-visual models by distilling traces from single-modality teachers via LLM merger followed by SFT cold-start and RL, achieving SOTA on OmniBench, DailyOmni, and MMAR with 3B/7B models.
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
cs.SD 2026-04 unverdicted novelty 6.0

Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
eess.AS 2026-04 unverdicted novelty 6.0

A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
LPM 1.0: Video-based Character Performance Model
cs.CV 2026-04 unverdicted novelty 6.0

LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization
cs.SD 2026-04 conditional novelty 6.0

A three-stage synthetic data pipeline generates 8800 doctor-patient conversations totaling 1.3k hours of audio and LLM-produced SOAP notes, with evaluation showing cascaded transcription-then-summarization models outp...
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
cs.AI 2026-05 unverdicted novelty 5.0

OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State
cs.CV 2026-05 unverdicted novelty 5.0

AllocMV uses a global planner to build a structured persistent state then solves a Multiple-Choice Knapsack Problem to allocate High-Gen, Mid-Gen, and Reuse compute branches, achieving an optimal Cost-Quality Ratio un...
EmoS: A High-Fidelity Multimodal Benchmark for Fine-grained Streaming Emotional Understanding
cs.CL 2026-05 unverdicted novelty 5.0

EmoS is a new high-fidelity benchmark for fine-grained streaming emotional understanding that produces measurable gains when used to fine-tune multimodal large language models.
AudioFace: Language-Assisted Speech-Driven Facial Animation with Multimodal Language Models
cs.CV 2026-05 unverdicted novelty 5.0

AudioFace improves speech-driven facial animation by guiding blendshape prediction with linguistic and articulatory information extracted via multimodal language models.
PRIMED: Adaptive Modality Suppression for Referring Audio-Visual Segmentation via Biased Competition
cs.CV 2026-05 unverdicted novelty 5.0

PRIMED improves referring audio-visual segmentation by using a modality prior decoder and competition-aware fusion to adaptively suppress irrelevant modalities.
Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models
eess.AS 2026-05 unverdicted novelty 5.0

A statistical sign-off protocol for audio compressors ensures worst-case answer preservation across query families in LALMs.
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
cs.CL 2026-05 unverdicted novelty 5.0

TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.
Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge
eess.AS 2026-04 unverdicted novelty 5.0

A new HumDial-FDBench benchmark and real human-recorded dual-channel dataset are released to assess full-duplex dialogue systems on interruptions and conversational flow.
Sema: Semantic Transport for Real-Time Multimodal Agents
cs.MM 2026-04 unverdicted novelty 5.0

Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while keeping multimodal agent task accuracy within 0.7 percentage points of raw baselines in WAN simulations.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 70 Pith papers · 22 internal anchors

[1]

Anastassiou, J

URL https://artofproblemsolving.com/wiki/index.php/A IME_Problems_and_Solutions. Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models.arXiv preprint arXiv:2406.02430,

work page arXiv
[2]

URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_ 3.pdf. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv:2403.20330, 2024a. Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. Voicebench: Benchmarking llm-based voice assistants.arXiv p...

work page internal anchor Pith review arXiv
[4]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

21 Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-Audio: Advancing universal audio understanding via unified large-scale audio-language models.CoRR, abs/2311.07919,

work page internal anchor Pith review arXiv
[5]

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,

work page internal anchor Pith review arXiv
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117,

work page internal anchor Pith review arXiv
[8]

Cosyvoice 3: Towards in-the-wild speech gen- eration via scaling-up and post-training.arXiv preprint arXiv:2505.17589, 2025

Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, Keyu An, Guanrou Yang, Yabin Li, Yanni Chen, Zhifu Gao, Qian Chen, Yue Gu, Mengzhe Chen, Yafeng Chen, Shiliang Zhang, Wen Wang, and Jieping Ye. Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training.CoRR, abs/2505.17589,

work page arXiv
[9]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv:2405.21075,

work page internal anchor Pith review arXiv
[11]

Are we done with mmlu? CoRR, abs/2406.04127,

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu?CoRR, abs/2406.04127,

work page arXiv
[12]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

URL https://storage.googleapis.com/deepmind-media/gemini/gemi ni_v1_5_report.pdf. Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, et al. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128,

work page internal anchor Pith review arXiv
[13]

Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recogniti...

work page 2024
[14]

arXiv preprint arXiv:2410.15553 , year=

22 Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, Shruti Bhosale, Chenguang Zhu, Karthik Abinav Sankararaman, Eryk Helenowski, Melanie Kambadur, Aditya Tayade, Hao Ma, Han Fang, and Sinong Wang. Multi-if: Benchmarking llms on multi-turn and multilingual instructions following.CoRR, abs/2410.15553,

work page arXiv
[15]

arXiv preprint arXiv:2410.15553 , year=

doi: 10.48550 /ARXIV.2410.15553. URLhttps://doi.org/10.48550/arXiv.2410.15553. Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms.CoRR, abs/2502.04326,

work page doi:10.48550/arxiv.2410.15553
[16]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv:2301.12597,

work page internal anchor Pith review arXiv
[17]

Zebralogic: On the scaling limits of llms for logical reasoning

Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. ZebraLogic: On the scaling limits of LLMs for logical reasoning.CoRR, abs/2502.01100,

work page arXiv
[18]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv:2304.08485,

work page internal anchor Pith review arXiv
[19]

Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning.arXiv:2203.10244,

work page arXiv
[20]

URL https://github.com/openai/openai-python/blob/e389823ba013a24b4c3 2ce38fa0bd87e6bccae94/chatml.md. OpenAI. GPT4 technical report.CoRR, abs/2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel

URLhttps://eqbench.com/creative_writing.html. Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching CLIP to count to ten. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pp. 3147–3157. IEEE,

work page 2023
[22]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022,

work page internal anchor Pith review arXiv
[23]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

URLhttps://arxiv.org/abs/2410.19168. Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced Transformer with rotary position embedding.Neurocomputing, 568:127063,

work page internal anchor Pith review arXiv
[24]

video-salmonn 2: Caption-enhanced audio-visual large language models.arXiv preprint arXiv:2506.15220, 2025

Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video-salmonn 2: Captioning-enhanced audio-visual large language models.CoRR, abs/2506.15220,

work page arXiv
[25]

Llama 2: Open Foundation and Fine-Tuned Chat Models

23 Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.htt...

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Towards understanding chain-of-thought prompting: An empirical study of what matters

Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. MMSU: A massive multi-task spoken language understanding and reasoning benchmark.CoRR, abs/2506.04779, 2025a. doi: 10.48550/ARXIV.2506.04779. URL https://doi.org/10.48550/arXiv.250 6.04779. Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongshe...

work page doi:10.48550/arxiv.2506.04779
[28]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv:2407.10671,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Qize Yang, Shimin Yao, Weixuan chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, and Jingren Zhou. Humanomniv2: From understanding to omni-m...

work page internal anchor Pith review Pith/arXiv arXiv
[30]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

24 Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.arXiv:2311.16502,

work page internal anchor Pith review arXiv
[31]

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark.arXiv preprint arXiv:2409.02813,

work page internal anchor Pith review arXiv
[32]

Does the enriched caption accurately align with the musical content of the audio (e.g., instrumenta- tion, vocals, tempo, mood, and production cues)?

Yongyi Zang, Sean O’Brien, Taylor Berg-Kirkpatrick, Julian McAuley, and Zachary Novack. Are you really listening? boosting perceptual awareness in music-qa benchmarks.arXiv preprint arXiv:2504.00369,

work page arXiv
[33]

15 Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu

Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, and Yucen He. Minimax-speech: Intrinsic zero-shot text-to-speech with a learnable speaker encoder.CoRR, abs/2505.07916,

work page arXiv
[34]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.CoRR, abs/2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities, 2025

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: benchmarking multi-task long video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pp. 13691–13701. Computer Vision Founda...

work page arXiv 2025
[37]

28 Yan-Martin Tamm and Anna Aljanaki A LIST OF MRS PAPERS USING CONTENT INFORMATION The complete list of papers used in section 2 can be found in Table 14

Haina Zhu, Yizhi Zhou, Hangting Chen, Jianwei Yu, Ziyang Ma, Rongzhi Gu, Yi Luo, Wei Tan, and Xie Chen. Muq: Self-supervised music representation learning with mel residual vector quantization. arXiv preprint arXiv:2501.01108,

work page arXiv