Recognition: 2 theorem links
· Lean TheoremQwen3-Omni Technical Report
Pith reviewed 2026-05-11 00:15 UTC · model grok-4.3
The pith
Qwen3-Omni maintains state-of-the-art performance on text, image, audio, and video tasks in a single model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Qwen3-Omni maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. It adopts a Thinker-Talker MoE architecture that unifies perception and generation, yielding fluent text and natural real-time speech. Across 36 audio and audio-visual benchmarks, it achieves open-source SOTA on 32 and overall SOTA on 22, outperforming closed-source models like Gemini-2.5-Pro.
What carries the argument
The Thinker-Talker MoE architecture, which separates thinking and talking components to unify multimodal perception and generation, combined with multi-codebook discrete speech codecs for low-latency streaming synthesis.
If this is right
- Matches performance of same-sized single-modal Qwen models on all modalities.
- Excels on audio tasks, leading 32 out of 36 benchmarks.
- Supports text in 119 languages, speech understanding in 19, and generation in 10.
- Enables theoretical end-to-end first-packet latency of 234 ms for streaming speech.
- Provides a fine-tuned Captioner variant for detailed audio descriptions with low hallucination.
Where Pith is reading between the lines
- This approach could allow future models to integrate even more modalities without performance loss.
- The low-latency streaming method might extend to other generative tasks beyond speech.
- Releasing the Captioner model could accelerate development of better audio analysis tools.
- The Thinking model variant demonstrates explicit reasoning over any input modality.
Load-bearing premise
The selected 36 audio and audio-visual benchmarks represent real-world multimodal performance without bias from benchmark choice or evaluation setup.
What would settle it
Results on a new, independently designed set of multimodal benchmarks where Qwen3-Omni shows clear degradation compared to specialized single-modal models.
read the original abstract
We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Qwen3-Omni, a unified multimodal model using a Thinker-Talker MoE architecture for perception and generation across text, image, audio, and video. It claims to match same-sized single-modal Qwen models with no degradation on text/image/video tasks while achieving open-source SOTA on 32 of 36 audio/audio-visual benchmarks and overall SOTA on 22, outperforming closed-source systems such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Additional contributions include multi-language support (119 text, 19 speech understanding, 10 speech generation), a multi-codebook streaming synthesis method yielding 234 ms theoretical first-packet latency via causal ConvNet, a Thinking model for multimodal reasoning, and a fine-tuned audio captioner variant; the 30B-A3B, Thinking, and Captioner models are released under Apache 2.0.
Significance. If the no-degradation and SOTA claims are substantiated by controlled, reproducible evaluations, the work would represent a meaningful advance in unified multimodal systems by showing that a single model can avoid typical cross-modal trade-offs while adding practical streaming and captioning capabilities. The open release and focus on audio excellence would facilitate community follow-up and applications in multilingual settings.
major comments (2)
- [Abstract] Abstract: the central claim that Qwen3-Omni 'maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts' and 'matches the performance of same-sized single-modal models within the Qwen series' is load-bearing, yet no quantitative tables, error bars, ablation results, or protocol details (prompt templates, decoding, data versions) are referenced to support direct head-to-head comparisons under identical conditions.
- [Abstract] Abstract (audio benchmarks paragraph): the assertion of open-source SOTA on 32/36 and overall SOTA on 22 benchmarks versus Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe rests on unstated evaluation equivalence; without disclosed re-evaluation of baselines under the same setup or exclusion rules, the cross-model superiority cannot be verified and directly affects the 'excels particularly on audio tasks' contribution.
minor comments (2)
- [Abstract] The abstract lists language support counts (119/19/10) but does not indicate whether these are supported in all modalities or only specific ones; a clarifying sentence or table would improve precision.
- [Abstract] The multi-codebook streaming mechanism and replacement of block-wise diffusion by causal ConvNet are described at high level; a short diagram or pseudocode would aid reproducibility of the 234 ms latency claim.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for recognizing the potential impact of Qwen3-Omni. We address the two major comments on the abstract below, providing point-by-point clarifications drawn from the full manuscript and committing to targeted revisions that improve transparency without altering the reported results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that Qwen3-Omni 'maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts' and 'matches the performance of same-sized single-modal models within the Qwen series' is load-bearing, yet no quantitative tables, error bars, ablation results, or protocol details (prompt templates, decoding, data versions) are referenced to support direct head-to-head comparisons under identical conditions.
Authors: We appreciate this observation. The manuscript contains the requested quantitative support in Sections 4 and 5. Section 4 presents head-to-head comparisons on text, image, and video benchmarks (Tables 1–4) against the corresponding single-modal Qwen2.5 and Qwen2 models of matching size, with per-task scores, standard deviations where multiple seeds were run, and explicit statements that no degradation occurs. Section 5 extends this to audio and audio-visual tasks (Tables 5–8). Ablations on the Thinker-Talker MoE routing, modality-specific adapters, and codebook usage appear in Section 6. Full protocol details—including prompt templates, decoding parameters (temperature, top-p), data versions, and benchmark splits—are provided in Section 3.3 and the appendix. To address the referee’s concern directly, we will revise the abstract to include explicit cross-references (e.g., “as shown in Tables 2 and 5 and detailed in Section 3.3”). This change makes the load-bearing claim traceable while preserving the abstract’s brevity. revision: yes
-
Referee: [Abstract] Abstract (audio benchmarks paragraph): the assertion of open-source SOTA on 32/36 and overall SOTA on 22 benchmarks versus Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe rests on unstated evaluation equivalence; without disclosed re-evaluation of baselines under the same setup or exclusion rules, the cross-model superiority cannot be verified and directly affects the 'excels particularly on audio tasks' contribution.
Authors: We agree that evaluation equivalence must be stated clearly. The 32/36 open-source SOTA and 22 overall SOTA counts are derived from the standardized benchmark suite described in Section 5. For open-source models we report our own runs under identical prompts and decoding settings; for closed-source systems (Gemini-2.5-Pro, GPT-4o-Transcribe, Seed-ASR) we used the latest publicly released API versions with the exact same benchmark inputs and post-processing rules as our model. Any exclusions (e.g., language-specific subsets or modality mismatches) are enumerated in the appendix table that accompanies each benchmark. We will add a concise clarifying clause to the abstract (“evaluated under consistent protocols; see Section 5 and Appendix B”) and expand the evaluation paragraph in Section 5 to list the precise API versions, prompt templates, and exclusion criteria used for each baseline. These revisions will allow independent verification of the audio-task superiority claim. revision: partial
Circularity Check
No circularity; performance claims rest on external benchmark comparisons
full rationale
The paper presents Qwen3-Omni as a multimodal model whose central claims are empirical: it matches single-modal Qwen baselines and achieves SOTA on 32 of 36 audio benchmarks while outperforming closed-source models. These results are reported as direct evaluations rather than derived quantities. No equations, fitted parameters renamed as predictions, or self-referential definitions appear in the architecture description (Thinker-Talker MoE) or latency techniques. The fine-tuning for the Captioner variant is an explicit post-training step, not a circular derivation. Any self-citations (if present in the full text) are not load-bearing for the performance assertions, which rely on external benchmarks. The derivation chain is therefore self-contained against independent test sets.
Axiom & Free-Parameter Ledger
free parameters (2)
- MoE expert count and routing parameters
- Multi-codebook speech codec configuration
axioms (1)
- domain assumption A single model can match specialized single-modal performance across modalities when using appropriate architecture and training.
Forward citations
Cited by 60 Pith papers
-
Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search
Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
-
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries
FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
-
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
-
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
-
Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization
DPOFusion uses direct preference optimization on property-aligned and preference-controllable latent diffusion models to produce adaptive infrared-visible image fusions aligned with heterogeneous human and machine vis...
-
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.
-
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.
-
EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs
EmoTrans is a new video benchmark with four progressive tasks that measures how well current multimodal LLMs handle dynamic emotion transitions rather than static recognition.
-
StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning
StoryTR is a new benchmark and agentic data pipeline that adds explicit Theory of Mind reasoning chains to train smaller video retrieval models, yielding a 15% relative IoU gain over larger baselines on narrative content.
-
Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding
LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.
-
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
-
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
-
From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench
ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.
-
Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs
Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.
-
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
-
Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding
MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.
-
OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video
OmniScript is a new 8B omni-modal model that turns long cinematic videos into scene-by-scene scripts and matches top proprietary models on temporal localization and semantic accuracy.
-
VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories
VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.
-
SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos
SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.
-
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
-
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware ca...
-
KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness
KoALa-Bench is a new public benchmark with six tasks that tests Korean speech recognition, translation, question answering, instruction following, and faithfulness in large audio language models.
-
DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning
A new 1695-sample multicultural dataset plus two modules for stable multimodal fusion and modality consistency yield state-of-the-art deception detection with cross-cultural transfer.
-
TiCo: Time-Controllable Spoken Dialogue Model
TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
-
OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs
OmniTrace converts token-level signals into span-level cross-modal attributions for open-ended generation in omni-modal LLMs via generation-time tracing.
-
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
SpeakerLLM unifies speaker profiling, recording-condition understanding, and structured verification reasoning in an audio-LLM via a hierarchical tokenizer and decision traces.
-
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
-
Meow-Omni 1: A Multimodal Large Language Model for Feline Ethology
Meow-Omni 1 is a quad-modal MLLM that fuses video, audio, physiological time-series, and text to achieve 71.16% accuracy on feline intent recognition in the new MeowBench benchmark.
-
NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics
VLMs fail to identify visual preconditions or apply physical laws in kinematic physics tasks, as shown by new FACT diagnostics and NICE calibration methods evaluated on six state-of-the-art models.
-
Head Similarity: Modeling Structured Whole-Head Appearance Beyond Face Recognition
Head Similarity extends identity recognition to structured whole-head similarity by capturing intra-identity appearance variations via hierarchical supervision on a weakly-labeled video benchmark.
-
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
-
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.
-
When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition
Current audio-language models fail to use clinical multimodal context for dysarthric speech recognition, but context-aware LoRA fine-tuning delivers large accuracy gains on the SAP dataset.
-
CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness
CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
-
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
-
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation
Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.
-
HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models
HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
-
TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis
TTS-PRISM defines a 12-dimensional perceptual schema, builds a targeted diagnostic dataset via adversarial synthesis and expert labels, and tunes an end-to-end model that outperforms generalist LLMs in human alignment...
-
OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation
OmniHuman is a new large-scale multi-scene dataset with video-, frame-, and individual-level annotations for human-centric video generation, accompanied by the OHBench benchmark that adds metrics aligned with human pe...
-
Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.
-
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
AVRT transfers reasoning to audio-visual models by distilling traces from single-modality teachers via LLM merger followed by SFT cold-start and RL, achieving SOTA on OmniBench, DailyOmni, and MMAR with 3B/7B models.
-
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
-
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
-
LPM 1.0: Video-based Character Performance Model
LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
-
Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization
A three-stage synthetic data pipeline generates 8800 doctor-patient conversations totaling 1.3k hours of audio and LLM-produced SOAP notes, with evaluation showing cascaded transcription-then-summarization models outp...
-
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.
-
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
-
AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State
AllocMV uses a global planner to build a structured persistent state then solves a Multiple-Choice Knapsack Problem to allocate High-Gen, Mid-Gen, and Reuse compute branches, achieving an optimal Cost-Quality Ratio un...
-
EmoS: A High-Fidelity Multimodal Benchmark for Fine-grained Streaming Emotional Understanding
EmoS is a new high-fidelity benchmark for fine-grained streaming emotional understanding that produces measurable gains when used to fine-tune multimodal large language models.
-
AudioFace: Language-Assisted Speech-Driven Facial Animation with Multimodal Language Models
AudioFace improves speech-driven facial animation by guiding blendshape prediction with linguistic and articulatory information extracted via multimodal language models.
-
PRIMED: Adaptive Modality Suppression for Referring Audio-Visual Segmentation via Biased Competition
PRIMED improves referring audio-visual segmentation by using a modality prior decoder and competition-aware fusion to adaptively suppress irrelevant modalities.
-
Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models
A statistical sign-off protocol for audio compressors ensures worst-case answer preservation across query families in LALMs.
-
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.
-
Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge
A new HumDial-FDBench benchmark and real human-recorded dual-channel dataset are released to assess full-duplex dialogue systems on interruptions and conversational flow.
-
Sema: Semantic Transport for Real-Time Multimodal Agents
Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while keeping multimodal agent task accuracy within 0.7 percentage points of raw baselines in WAN simulations.
Reference graph
Works this paper leans on
-
[1]
URL https://artofproblemsolving.com/wiki/index.php/A IME_Problems_and_Solutions. Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models.arXiv preprint arXiv:2406.02430,
-
[2]
URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_ 3.pdf. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Are We on the Right Way for Evaluating Large Vision-Language Models?
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv:2403.20330, 2024a. Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. Voicebench: Benchmarking llm-based voice assistants.arXiv p...
work page internal anchor Pith review arXiv
-
[4]
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
21 Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-Audio: Advancing universal audio understanding via unified large-scale audio-language models.CoRR, abs/2311.07919,
work page internal anchor Pith review arXiv
-
[5]
Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,
work page internal anchor Pith review arXiv
-
[6]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117,
work page internal anchor Pith review arXiv
-
[8]
Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, Keyu An, Guanrou Yang, Yabin Li, Yanni Chen, Zhifu Gao, Qian Chen, Yue Gu, Mengzhe Chen, Yafeng Chen, Shiliang Zhang, Wen Wang, and Jieping Ye. Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training.CoRR, abs/2505.17589,
-
[9]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv:2405.21075,
work page internal anchor Pith review arXiv
-
[11]
Are we done with mmlu? CoRR, abs/2406.04127,
Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu?CoRR, abs/2406.04127,
-
[12]
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
URL https://storage.googleapis.com/deepmind-media/gemini/gemi ni_v1_5_report.pdf. Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, et al. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128,
work page internal anchor Pith review arXiv
-
[13]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recogniti...
work page 2024
-
[14]
arXiv preprint arXiv:2410.15553 , year=
22 Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, Shruti Bhosale, Chenguang Zhu, Karthik Abinav Sankararaman, Eryk Helenowski, Melanie Kambadur, Aditya Tayade, Hao Ma, Han Fang, and Sinong Wang. Multi-if: Benchmarking llms on multi-turn and multilingual instructions following.CoRR, abs/2410.15553,
-
[15]
arXiv preprint arXiv:2410.15553 , year=
doi: 10.48550 /ARXIV.2410.15553. URLhttps://doi.org/10.48550/arXiv.2410.15553. Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms.CoRR, abs/2502.04326,
-
[16]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv:2301.12597,
work page internal anchor Pith review arXiv
-
[17]
Zebralogic: On the scaling limits of llms for logical reasoning
Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. ZebraLogic: On the scaling limits of LLMs for logical reasoning.CoRR, abs/2502.01100,
-
[18]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv:2304.08485,
work page internal anchor Pith review arXiv
-
[19]
Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning.arXiv:2203.10244,
-
[20]
URL https://github.com/openai/openai-python/blob/e389823ba013a24b4c3 2ce38fa0bd87e6bccae94/chatml.md. OpenAI. GPT4 technical report.CoRR, abs/2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel
URLhttps://eqbench.com/creative_writing.html. Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching CLIP to count to ten. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pp. 3147–3157. IEEE,
work page 2023
-
[22]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022,
work page internal anchor Pith review arXiv
-
[23]
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
URLhttps://arxiv.org/abs/2410.19168. Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced Transformer with rotary position embedding.Neurocomputing, 568:127063,
work page internal anchor Pith review arXiv
-
[24]
Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video-salmonn 2: Captioning-enhanced audio-visual large language models.CoRR, abs/2506.15220,
-
[25]
Llama 2: Open Foundation and Fine-Tuned Chat Models
23 Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.htt...
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Towards understanding chain-of-thought prompting: An empirical study of what matters
Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. MMSU: A massive multi-task spoken language understanding and reasoning benchmark.CoRR, abs/2506.04779, 2025a. doi: 10.48550/ARXIV.2506.04779. URL https://doi.org/10.48550/arXiv.250 6.04779. Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongshe...
-
[28]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv:2407.10671,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Qize Yang, Shimin Yao, Weixuan chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, and Jingren Zhou. Humanomniv2: From understanding to omni-m...
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
24 Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.arXiv:2311.16502,
work page internal anchor Pith review arXiv
-
[31]
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark.arXiv preprint arXiv:2409.02813,
work page internal anchor Pith review arXiv
-
[32]
Yongyi Zang, Sean O’Brien, Taylor Berg-Kirkpatrick, Julian McAuley, and Zachary Novack. Are you really listening? boosting perceptual awareness in music-qa benchmarks.arXiv preprint arXiv:2504.00369,
-
[33]
15 Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu
Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, and Yucen He. Minimax-speech: Intrinsic zero-shot text-to-speech with a learnable speaker encoder.CoRR, abs/2505.07916,
-
[34]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.CoRR, abs/2311.07911,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities, 2025
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: benchmarking multi-task long video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pp. 13691–13701. Computer Vision Founda...
-
[37]
Haina Zhu, Yizhi Zhou, Hangting Chen, Jianwei Yu, Ziyang Ma, Rongzhi Gu, Yi Luo, Wei Tan, and Xie Chen. Muq: Self-supervised music representation learning with mel residual vector quantization. arXiv preprint arXiv:2501.01108,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.