arxiv: 2604.15804 · v2 · submitted 2026-04-17 · 💻 cs.CL · eess.AS

Recognition: unknown

Qwen3.5-Omni Technical Report

Qwen Team

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:07 UTC · model grok-4.3

classification 💻 cs.CL eess.AS

keywords omnimodal modelaudio-visual understandingMixture-of-Expertsspeech synthesis alignmentlong context processingmultilingual generationaudio-visual coding

0 comments

The pith

Qwen3.5-Omni scales to hundreds of billions of parameters and reports state-of-the-art results on 215 audio-visual benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Qwen3.5-Omni as the next step in the Qwen-Omni line, expanding model size dramatically and training on over 100 million hours of audio-visual material alongside text-vision data. It claims the resulting system reaches leading performance across 215 subtasks covering audio understanding, reasoning, interaction, and video grounding. The authors describe a Hybrid Attention Mixture-of-Experts design and a new alignment technique called ARIA to make streaming speech more stable and natural. If the results hold, the work indicates that large-scale omnimodal training can produce systems handling long audio streams, emotional multilingual speech, timed video captions, and even direct coding from combined audio-visual prompts.

Core claim

Qwen3.5-Omni scales to hundreds of billions of parameters with 256k context length and trains on heterogeneous text-vision pairs plus over 100 million hours of audio-visual content. The model employs a Hybrid Attention Mixture-of-Experts framework for its Thinker and Talker components to support efficient long-sequence inference. The plus variant achieves leading results on 215 audio and audio-visual subtasks and benchmarks. ARIA is introduced to dynamically align text and speech units, improving stability and prosody in conversational speech synthesis with limited extra latency. The model supports multilingual understanding and generation across 10 languages with emotional expression, shows

What carries the argument

The Hybrid Attention Mixture-of-Experts framework for Thinker and Talker modules together with the ARIA mechanism that aligns text and speech tokenizers.

If this is right

The model can process and understand over 10 hours of audio and up to 400 seconds of 720P video at 1 FPS within a single context window.
Streaming speech synthesis gains stability and natural prosody through dynamic alignment of text and speech units while keeping latency low.
Multilingual speech understanding and generation extend to 10 languages while preserving human-like emotional tone.
Audio-visual input yields script-level structured captions with exact temporal synchronization and automated scene segmentation.
An emergent ability appears for performing coding tasks directly from audio-visual instructions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the benchmark results prove robust, comparable scaling strategies could shorten the path to reliable real-time omnimodal assistants for education or accessibility applications.
The reported Audio-Visual Vibe Coding capability suggests models may eventually accept creative multi-sensory prompts to generate or edit code without requiring separate text input.
Long audio and video context windows open routes for single-pass analysis of full meetings, lectures, or films to produce summaries or insights.
Training at the reported data volume raises the question of whether focused curation of high-quality audio-visual examples could achieve similar gains with reduced overall compute.

Load-bearing premise

The 215 subtasks and benchmarks provide a fair, comprehensive, and unbiased measure of omnimodal capability without post-hoc selection or undisclosed evaluation details.

What would settle it

Independent re-testing on a new collection of audio and audio-visual tasks outside the original 215 that shows no performance advantage over prior leading models.

Figures

Figures reproduced from arXiv: 2604.15804 by Qwen Team.

**Figure 2.** Figure 2: The overview of Qwen3.5-Omni. Qwen3.5-Omni adopts the Thinker-Talker architecture. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The overview of AuT. Consuming 40 million hours of supervised data especially more multi [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Qwen3.5-Omni is a scaled-up model release with a practical tokenizer fix and an emergent coding behavior, but the SOTA numbers on 215 subtasks cannot be checked without the actual task list and scores.

read the letter

The core of this report is the release of a much larger Qwen-Omni model that reaches hundreds of billions of parameters and 256k context while handling long audio and video inputs. They add ARIA to align text and speech tokenizers on the fly, which they claim stabilizes streaming speech synthesis and improves prosody without much added latency. They also note an emergent Audio-Visual Vibe Coding ability where the model generates code from combined audio-visual instructions, plus multilingual speech with emotional tone across 10 languages and structured video captioning with temporal alignment. The hybrid attention MoE setup for both thinker and talker components is presented as the way they keep inference efficient at that scale and context length. Training on over 100 million hours of audio-visual data plus text-vision pairs is the main scaling lever they highlight. These pieces are concrete enough that someone building similar systems could try the ARIA idea or the long-context MoE pattern. The evaluation story is the clear weak point. The headline result is SOTA across 215 audio and audio-visual subtasks with direct comparisons to Gemini-3.1 Pro on key audio tasks. Yet the text gives no list of the 215 items, no per-task numbers, no indication of how many are standard public benchmarks versus internal ones, and no protocol for selecting the Gemini comparisons. That makes the superiority claim hard to assess and leaves open the possibility that the numbers reflect evaluation choices more than raw capability gains. The link from the massive training scale to the reported improvements stays correlational. This report is mainly useful for groups tracking what the big labs are shipping in multimodal models and for engineers who want to see the current state of long-context omnimodal interaction. Readers who need reproducible methods or independent verification will find it light on details. It still deserves peer review because the scale, the ARIA alignment, and the reported emergent behavior are worth external scrutiny, but any referee will have to press for the full evaluation breakdown and task enumeration before the central claims can be taken at face value.

Referee Report

3 major / 2 minor

Summary. The paper presents Qwen3.5-Omni, the latest in the Qwen-Omni model family. It scales to hundreds of billions of parameters with a 256k context length and is trained on heterogeneous text-vision pairs and over 100 million hours of audio-visual content. The model uses a Hybrid Attention Mixture-of-Experts framework and introduces ARIA to improve streaming speech synthesis stability. It claims SOTA results on 215 audio and audio-visual subtasks, surpassing Gemini-3.1 Pro in key areas, supports 10 languages, and demonstrates new capabilities like Audio-Visual Vibe Coding.

Significance. If the SOTA claims hold under rigorous scrutiny, this would mark a notable step forward in omnimodal models capable of long-context audio-visual reasoning and interaction. The ARIA innovation could influence future work on multimodal token alignment. The report highlights potential emergent abilities, which is of interest to the community. However, without detailed benchmarks, the significance remains provisional.

major comments (3)

The assertion of SOTA performance across 215 subtasks lacks any quantitative tables, per-subtask scores, baseline comparisons, or details on how these subtasks were chosen or evaluated, which is load-bearing for the central claim as noted in the stress-test.
No error bars, statistical significance, or data exclusion criteria are mentioned for the comparisons to Gemini-3.1 Pro, undermining confidence in the reported superiority.
The description of ARIA lacks specific algorithmic details, such as the alignment method or loss function, making it difficult to reproduce or evaluate its impact on prosody and latency.

minor comments (2)

Specify the exact parameter count instead of 'hundreds of billions' for precision.
Distinguish clearly between Qwen3.5-Omni and Qwen3.5-Omni-plus in the performance claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on the Qwen3.5-Omni technical report. We address each major comment below and have revised the manuscript accordingly to provide greater transparency and detail.

read point-by-point responses

Referee: The assertion of SOTA performance across 215 subtasks lacks any quantitative tables, per-subtask scores, baseline comparisons, or details on how these subtasks were chosen or evaluated, which is load-bearing for the central claim as noted in the stress-test.

Authors: We agree that the high-level claim requires supporting quantitative evidence. In the revised manuscript we have added comprehensive tables in Section 4 and the appendix that report per-subtask scores, direct baseline comparisons (including Gemini-3.1 Pro), and explicit criteria for subtask selection and evaluation protocols. revision: yes
Referee: No error bars, statistical significance, or data exclusion criteria are mentioned for the comparisons to Gemini-3.1 Pro, undermining confidence in the reported superiority.

Authors: We acknowledge the need for these statistical details. The revised version now includes error bars derived from multiple evaluation runs, reports p-values for key comparisons, and specifies the data exclusion criteria applied during benchmarking. revision: yes
Referee: The description of ARIA lacks specific algorithmic details, such as the alignment method or loss function, making it difficult to reproduce or evaluate its impact on prosody and latency.

Authors: We have expanded the ARIA description to include the precise alignment mechanism (cross-modal dynamic unit alignment via learned projections), the composite loss function (alignment loss plus prosody consistency and latency regularization terms), and additional ablation results on prosody and latency. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed results

full rationale

The paper is an empirical technical report describing model scaling, training data volume, architectural choices (Hybrid Attention MoE), and the introduction of ARIA for speech stability. Its central claims consist of reported benchmark performance (SOTA across 215 subtasks, comparisons to Gemini-3.1 Pro) and qualitative observations such as Audio-Visual Vibe Coding. No mathematical derivation chain, first-principles equations, or parameter-fitting steps are presented that reduce by construction to the inputs. The evaluation results are external to the training process and do not exhibit self-definition, fitted-input-as-prediction, or load-bearing self-citation loops. Standard empirical reporting of this type is self-contained against external benchmarks and receives a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claims rest on the assumption that massive heterogeneous data plus standard MoE scaling produces the stated capabilities; no explicit free parameters, axioms, or invented entities are detailed beyond the introduction of ARIA as a new alignment technique.

invented entities (1)

ARIA no independent evidence
purpose: Dynamic alignment of text and speech tokenizers to improve streaming speech stability and prosody
Introduced in the abstract as a novel component to address tokenizer efficiency discrepancies; no independent evidence provided beyond the claim of minimal latency impact.

pith-pipeline@v0.9.0 · 5625 in / 1421 out tokens · 39868 ms · 2026-05-10T08:07:27.510743+00:00 · methodology

discussion (0)

Forward citations

Cited by 15 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search
cs.SD 2026-05 unverdicted novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
cs.CV 2026-05 unverdicted novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
cs.CV 2026-05 unverdicted novelty 7.0

RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...
Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models
cs.CV 2026-05 unverdicted novelty 7.0

STEMO-Bench evaluates intermediate spatio-temporal reasoning in video MLLMs via object-centric facts, and STEMO-Track improves consistency by chunk-wise trajectory construction and aggregation.
FraudBench: A Multimodal Benchmark for Detecting AI-Generated Fraudulent Refund Evidence
cs.CV 2026-05 unverdicted novelty 7.0

FraudBench shows that current multimodal LLMs and specialized AI-image detectors often fail to spot AI-generated fake damage in refund evidence, with true positive rates frequently below 50% on synthetic subsets while...
Do Joint Audio-Video Generation Models Understand Physics?
cs.SD 2026-05 unverdicted novelty 7.0

Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
cs.CV 2026-04 unverdicted novelty 7.0

Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
cs.AI 2026-05 unverdicted novelty 6.0

MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
cs.AI 2026-05 unverdicted novelty 6.0

MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...
PresentAgent-2: Towards Generalist Multimodal Presentation Agents
cs.CV 2026-05 unverdicted novelty 6.0

PresentAgent-2 generates query-driven multimodal presentation videos with research grounding, supporting single-speaker, multi-speaker discussion, and interactive question-answering modes.
Towards Generation-Efficient Uncertainty Estimation in Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Uncertainty estimation for LLM hallucinations can be done effectively with partial generations or input-only predictors, reducing the need for full autoregressive sampling.
From History to State: Constant-Context Skill Learning for LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

Constant-context skill learning trains reusable task-family modules for LLM agents using a deterministic state block for progress tracking and subgoal rewards, achieving 89.6% unseen success on ALFWorld, 76.8% on WebS...
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
cs.AI 2026-05 unverdicted novelty 5.0

OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
Towards Effective Theory of LLMs: A Representation Learning Approach
cs.LG 2026-05 unverdicted novelty 5.0

RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
Step-Audio-R1.5 Technical Report
eess.AS 2026-04 unverdicted novelty 4.0

Step-Audio-R1.5 applies RLHF to audio reasoning models to maintain analytical performance while improving prosodic naturalness and immersion in extended spoken interactions.

Reference graph

Works this paper leans on

52 extracted references · 41 canonical work pages · cited by 14 Pith papers · 21 internal anchors

[1]

Anastassiou, J

Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models.arXiv preprint arXiv:2406.02430,

work page arXiv
[2]

Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis M

URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_ 3.pdf. Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common voice: A massively- multilingual speech corpus. In Nicoletta Calzolari, Frédéric Béchet, Ph...

2020
[3]

URL https://aclanthology.org/2020.lrec-1.520/. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

URL https://arxiv.org/abs/2506.07982. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS,

work page internal anchor Pith review arXiv
[5]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv:2403.20330, 2024a. Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. Voicebench: Benchmarking llm-based voice assistants.arXiv p...

work page internal anchor Pith review arXiv
[6]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-Audio: Advancing universal audio understanding via unified large-scale audio-language models.CoRR, abs/2311.07919,

work page internal anchor Pith review arXiv
[7]

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,

work page internal anchor Pith review arXiv
[8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Fleurs: Few-shot learning evaluation of universal representations of speech.2022 IEEE Spoken Language T echnology Workshop (SLT), pp

Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech.2022 IEEE Spoken Language T echnology Workshop (SLT), pp. 798–805,

2022
[10]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

URL https: //api.semanticscholar.org/CorpusID:249062909. Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Benchmark- ing spatial understanding for embodied tasks with large vision-language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computa...

work page internal anchor Pith review arXiv 2024
[11]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv:2405.21075,

work page internal anchor Pith review arXiv
[13]

Are we done with mmlu? CoRR, abs/2406.04127,

18 Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu?CoRR, abs/2406.04127,

work page arXiv
[14]

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou

URL https://storage.googleapis.com/deepmind-media/gemini/gemi ni_v1_5_report.pdf. Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large visi...

2024
[15]

Worldsense: Evaluating real-world omnimodal understanding for multimodal llms, 2025

Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms.CoRR, abs/2502.04326,

work page arXiv
[16]

URLhttps://doi.org/10.1109/TASL.2009.2026503

doi: 10.1109/TASL.2009.2026503. URLhttps://doi.org/10.1109/TASL.2009.2026503. Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-Eval: A multi- level multi-discipline chinese evaluation suite for foundation models. InNeurIPS,

work page doi:10.1109/tasl.2009.2026503 2009
[17]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code.CoRR, abs/2403.07974,

work page internal anchor Pith review arXiv
[18]

arXiv preprint arXiv:2508.02013

Changhao Jiang, Jiajun Sun, Yifei Cao, Jiabao Zhuang, Hui Li, Xiaoran Fan, Ming Zhang, Junjie Ye, Shihan Dou, Zhiheng Xi, et al. Speechrole: A large-scale dataset and benchmark for evaluating speech role-playing agents.arXiv preprint arXiv:2508.02013,

work page arXiv
[19]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv:2301.12597,

work page internal anchor Pith review arXiv
[20]

Omnigaia: Towards native omni-modal ai agents, 2026

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong, Jiajie Jin, Hao Wang, Yinuo Wang, Ji-Rong Wen, Yuan Lu, et al. Omnigaia: Towards native omni-modal ai agents.arXiv preprint arXiv:2602.22897,

work page arXiv
[21]

Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In18th IEEE International Sympo- sium on Biomedical Imaging, ISBI 2021, Nice, France, April 13-16, 2021, pp. 1650–1654. IEEE,

2021
[22]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv:2304.08485,

work page internal anchor Pith review arXiv
[23]

doi: 10.1007/s11432-024-423 5-6

ISSN 1869-1919. doi: 10.1007/s11432-024-423 5-6. URLhttp://dx.doi.org/10.1007/s11432-024-4235-6. 19 Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InICLR,

work page doi:10.1007/s11432-024-423 1919
[24]

Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V . Le, and Junehyuk Jung. Towards robust mathematical reasoning. InProceedings of the 2025 C...

2025
[25]

URLhttps://aclanthology.org/2025.emnlp-main.1794/. Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, and Aixin Sun. MMLONGBENCH-DOC: benchmarking long-context document understanding with visualizations. In Amir Globersons, Lester ...

2025
[26]

Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.arXiv preprint arXiv:2505.13032, 2025

Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, Tianrui Wang, Yuping Wang, Yuxuan Wang, Yihao Wu, Guanrou Yang, Jianwei Yu, Ruibin Yuan, Zhisheng Zheng, Ziya Zhou, Haina Zhu, Wei Xu...

work page doi:10.48550/arxiv.2505.13032
[27]

URL https://github.com/openai/openai-python/blob/e389823ba013a24b4c3 2ce38fa0bd87e6bccae94/chatml.md. OpenAI. GPT4 technical report.CoRR, abs/2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel

URLhttps://openai.com/index/hello-gpt-4o/. Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching CLIP to count to ten. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pp. 3147–3157. IEEE,

2023
[29]

Librispeech: An ASR corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24,

2015
[30]

Can vision-language models answer face to face questions in the real-world?arXiv preprint arXiv:2503.19356,

Reza Pourreza, Rishit Dagli, Apratim Bhattacharyya, Sunny Panchal, Guillaume Berger, and Roland Memisevic. Can vision-language models answer face to face questions in the real-world?arXiv preprint arXiv:2503.19356,

work page arXiv
[31]

Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833,

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following.CoRR, abs/2507.02833,

work page arXiv
[32]

Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833,

doi: 10.48550/ARXIV.2507.02833. URL https://doi.org/10.48550/arXiv.2507. 02833. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS,

work page doi:10.48550/arxiv.2507.02833
[33]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022,

work page internal anchor Pith review arXiv
[34]

Zerobench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025

20 Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, Kai Han, and Samuel Albanie. Zerobench: ...

work page arXiv
[35]

URLhttps://arxiv.org/abs/2410.19168. Yang Shi, Huanqian Wang, Wulin Xie, Huanyao Zhang, Lijie Zhao, Yifan Zhang, Xinfeng Li, Chaoyou Fu, Zhuoer Wen, Wenting Liu, Zhuoran Zhang, Xinlong Chen, Bohan Zeng, Sihan Yang, Yuanxing Zhang, Pengfei Wan, Haotian Wang, and Wenjing Yang. Mme-videoocr: Evaluating ocr-based capabilities of multimodal llms in video scena...

work page internal anchor Pith review arXiv
[36]

Kespeech: An open source speech dataset of mandarin and its eight subdialects

Zhiyuan Tang, Dong Wang, Yanguang Xu, Jianwei Sun, Xiaoning Lei, Shuaijiang Zhao, Cheng Wen, Xingjun Tan, Chuandong Xie, Shuran Zhou, Rui Yan, Chenjia Lv, Yang Han, Wei Zou, and Xiangang Li. Kespeech: An open source speech dataset of mandarin and its eight subdialects. In Joaquin Vanschoren and Sai-Kit Yeung (eds.),Proceedings of the Neural Information Pr...

2021
[37]

Gemini Robotics: Bringing AI into the Physical World

URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/0336dcbab05b9d5ad24f 4333c7658a0e-Abstract-round2.html. Artificial Analysis Team. Artificial analysis long context reasoning benchmark (lcr). Artificial Analysis, Inc., 2025a. Dataset. Gemini Robotics Team. Gemini robotics: Bringing AI into the physical world.CoRR, abs/2503.20020, 2025...

work page internal anchor Pith review arXiv 2021
[38]

Supergpqa: Scaling llm evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025

doi: 10.48550/ARXIV.2502.14739. URL https://doi.org/10.48550/arXiv.2502.14739. Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February

work page doi:10.48550/arxiv.2502.14739
[39]

Llama 2: Open Foundation and Fine-Tuned Chat Models

URL https://qwen.ai/blog?id=qwen3.5. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Towards understanding chain-of-thought prompting: An empirical study of what matters

Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. MMSU: A massive multi-task spoken language understanding and reasoning benchmark.CoRR, abs/2506.04779, 2025a. doi: 10.48550/ARXIV.2506.04779. URL https://doi.org/10.48550/arXiv.250 6.04779. Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongshe...

work page doi:10.48550/arxiv.2506.04779 2022
[41]

URL https://doi.org/10.21437/Interspeech.2022-48

doi: 10.21437/INTERSPEECH.2022-48. URL https://doi.org/10.21437/Interspeech.2022-48. Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, and Zhizheng Wu. Maskgct: Zero-shot text-to-speech with masked generative codec transformer.arXiv preprint arXiv:2409.00750, 2024c. Yubo Wang, Xuegua...

work page doi:10.21437/interspeech.2022-48 2022
[42]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others

Ruiqi Yan, Xiquan Li, Wenxi Chen, Zhikang Niu, Chen Yang, Ziyang Ma, Kai Yu, and Xie Chen. Uro-bench: A comprehensive benchmark for end-to-end spoken dialogue models.arXiv preprint arXiv:2502.17810,

work page arXiv
[43]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv:2407.10671, 2024a. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,...

work page internal anchor Pith review arXiv 2025
[44]

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark.arXiv preprint arXiv:2409.02813,

work page internal anchor Pith review arXiv
[45]

Does the enriched caption accurately align with the musical content of the audio (e.g., instrumenta- tion, vocals, tempo, mood, and production cues)?

Yongyi Zang, Sean O’Brien, Taylor Berg-Kirkpatrick, Julian McAuley, and Zachary Novack. Are you really listening? boosting perceptual awareness in music-qa benchmarks.arXiv preprint arXiv:2504.00369,

work page arXiv
[46]

WENETSPEECH: A 10000+ hours multi-domain mandarin corpus for speech recognition

Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu, and Zhendong Peng. WENETSPEECH: A 10000+ hours multi-domain mandarin corpus for speech recognition. InIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, pp. 6182...

2022
[47]

Audioclip: Extending clip to image, text and audio

doi: 10.1109/ICASSP43922.2022.9746682. URLhttps://doi.org/10.1109/ICASSP43922.2022.9746682. 22 Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, and Yucen He. Minimax-spe...

work page doi:10.1109/icassp43922.2022.9746682 2022
[48]

Mimo-audio: Audio language models are few-shot learners

Xiaomi LLM-Core Team Dong Zhang, Gang Wang, Jinlong Xue, Kai Fang, Liang Zhao, Rui Ma, Shu- Qin Ren, Shuo Liu, Tao Guo, Weiji Zhuang, Xin Zhang, Xi-Na Song, Yihan Yan, Yongzhe He, Cici, Bowen Shen, Chengxuan Zhu, Chong Ma, Chun Chen, Heyu Chen, Jiawei Li, Lei Li, Menghang Zhu, Peidian Li, Qiying Wang, Sirui Deng, Weimin Xiong, Wen Huang, Wenyu Yang, Yilin...

work page arXiv
[49]

MMVU: measuring expert-level multi-discipline video understanding

Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, Chengye Wang, Ziyao Shangguan, Zhenwen Liang, Yixin Liu, Chen Zhao, and Arman Cohan. MMVU: measuring expert-level multi-discipline video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashvill...

2025
[50]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

work page internal anchor Pith review arXiv
[51]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.CoRR, abs/2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv
[52]

Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities, 2025

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: benchmarking multi-task long video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pp. 13691–13701. Computer Vision Founda...

work page arXiv 2025