pith. machine review for the scientific record. sign in

arxiv: 2604.15804 · v2 · submitted 2026-04-17 · 💻 cs.CL · eess.AS

Recognition: unknown

Qwen3.5-Omni Technical Report

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:07 UTC · model grok-4.3

classification 💻 cs.CL eess.AS
keywords omnimodal modelaudio-visual understandingMixture-of-Expertsspeech synthesis alignmentlong context processingmultilingual generationaudio-visual coding
0
0 comments X

The pith

Qwen3.5-Omni scales to hundreds of billions of parameters and reports state-of-the-art results on 215 audio-visual benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Qwen3.5-Omni as the next step in the Qwen-Omni line, expanding model size dramatically and training on over 100 million hours of audio-visual material alongside text-vision data. It claims the resulting system reaches leading performance across 215 subtasks covering audio understanding, reasoning, interaction, and video grounding. The authors describe a Hybrid Attention Mixture-of-Experts design and a new alignment technique called ARIA to make streaming speech more stable and natural. If the results hold, the work indicates that large-scale omnimodal training can produce systems handling long audio streams, emotional multilingual speech, timed video captions, and even direct coding from combined audio-visual prompts.

Core claim

Qwen3.5-Omni scales to hundreds of billions of parameters with 256k context length and trains on heterogeneous text-vision pairs plus over 100 million hours of audio-visual content. The model employs a Hybrid Attention Mixture-of-Experts framework for its Thinker and Talker components to support efficient long-sequence inference. The plus variant achieves leading results on 215 audio and audio-visual subtasks and benchmarks. ARIA is introduced to dynamically align text and speech units, improving stability and prosody in conversational speech synthesis with limited extra latency. The model supports multilingual understanding and generation across 10 languages with emotional expression, shows

What carries the argument

The Hybrid Attention Mixture-of-Experts framework for Thinker and Talker modules together with the ARIA mechanism that aligns text and speech tokenizers.

If this is right

  • The model can process and understand over 10 hours of audio and up to 400 seconds of 720P video at 1 FPS within a single context window.
  • Streaming speech synthesis gains stability and natural prosody through dynamic alignment of text and speech units while keeping latency low.
  • Multilingual speech understanding and generation extend to 10 languages while preserving human-like emotional tone.
  • Audio-visual input yields script-level structured captions with exact temporal synchronization and automated scene segmentation.
  • An emergent ability appears for performing coding tasks directly from audio-visual instructions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the benchmark results prove robust, comparable scaling strategies could shorten the path to reliable real-time omnimodal assistants for education or accessibility applications.
  • The reported Audio-Visual Vibe Coding capability suggests models may eventually accept creative multi-sensory prompts to generate or edit code without requiring separate text input.
  • Long audio and video context windows open routes for single-pass analysis of full meetings, lectures, or films to produce summaries or insights.
  • Training at the reported data volume raises the question of whether focused curation of high-quality audio-visual examples could achieve similar gains with reduced overall compute.

Load-bearing premise

The 215 subtasks and benchmarks provide a fair, comprehensive, and unbiased measure of omnimodal capability without post-hoc selection or undisclosed evaluation details.

What would settle it

Independent re-testing on a new collection of audio and audio-visual tasks outside the original 215 that shows no performance advantage over prior leading models.

Figures

Figures reproduced from arXiv: 2604.15804 by Qwen Team.

Figure 1
Figure 1. Figure 1: Qwen3.5-Omni is a unified end-to-end model capable of processing multiple modalities, such [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overview of Qwen3.5-Omni. Qwen3.5-Omni adopts the Thinker-Talker architecture. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The overview of AuT. Consuming 40 million hours of supervised data especially more multi [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents Qwen3.5-Omni, the latest in the Qwen-Omni model family. It scales to hundreds of billions of parameters with a 256k context length and is trained on heterogeneous text-vision pairs and over 100 million hours of audio-visual content. The model uses a Hybrid Attention Mixture-of-Experts framework and introduces ARIA to improve streaming speech synthesis stability. It claims SOTA results on 215 audio and audio-visual subtasks, surpassing Gemini-3.1 Pro in key areas, supports 10 languages, and demonstrates new capabilities like Audio-Visual Vibe Coding.

Significance. If the SOTA claims hold under rigorous scrutiny, this would mark a notable step forward in omnimodal models capable of long-context audio-visual reasoning and interaction. The ARIA innovation could influence future work on multimodal token alignment. The report highlights potential emergent abilities, which is of interest to the community. However, without detailed benchmarks, the significance remains provisional.

major comments (3)
  1. The assertion of SOTA performance across 215 subtasks lacks any quantitative tables, per-subtask scores, baseline comparisons, or details on how these subtasks were chosen or evaluated, which is load-bearing for the central claim as noted in the stress-test.
  2. No error bars, statistical significance, or data exclusion criteria are mentioned for the comparisons to Gemini-3.1 Pro, undermining confidence in the reported superiority.
  3. The description of ARIA lacks specific algorithmic details, such as the alignment method or loss function, making it difficult to reproduce or evaluate its impact on prosody and latency.
minor comments (2)
  1. Specify the exact parameter count instead of 'hundreds of billions' for precision.
  2. Distinguish clearly between Qwen3.5-Omni and Qwen3.5-Omni-plus in the performance claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on the Qwen3.5-Omni technical report. We address each major comment below and have revised the manuscript accordingly to provide greater transparency and detail.

read point-by-point responses
  1. Referee: The assertion of SOTA performance across 215 subtasks lacks any quantitative tables, per-subtask scores, baseline comparisons, or details on how these subtasks were chosen or evaluated, which is load-bearing for the central claim as noted in the stress-test.

    Authors: We agree that the high-level claim requires supporting quantitative evidence. In the revised manuscript we have added comprehensive tables in Section 4 and the appendix that report per-subtask scores, direct baseline comparisons (including Gemini-3.1 Pro), and explicit criteria for subtask selection and evaluation protocols. revision: yes

  2. Referee: No error bars, statistical significance, or data exclusion criteria are mentioned for the comparisons to Gemini-3.1 Pro, undermining confidence in the reported superiority.

    Authors: We acknowledge the need for these statistical details. The revised version now includes error bars derived from multiple evaluation runs, reports p-values for key comparisons, and specifies the data exclusion criteria applied during benchmarking. revision: yes

  3. Referee: The description of ARIA lacks specific algorithmic details, such as the alignment method or loss function, making it difficult to reproduce or evaluate its impact on prosody and latency.

    Authors: We have expanded the ARIA description to include the precise alignment mechanism (cross-modal dynamic unit alignment via learned projections), the composite loss function (alignment loss plus prosody consistency and latency regularization terms), and additional ablation results on prosody and latency. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed results

full rationale

The paper is an empirical technical report describing model scaling, training data volume, architectural choices (Hybrid Attention MoE), and the introduction of ARIA for speech stability. Its central claims consist of reported benchmark performance (SOTA across 215 subtasks, comparisons to Gemini-3.1 Pro) and qualitative observations such as Audio-Visual Vibe Coding. No mathematical derivation chain, first-principles equations, or parameter-fitting steps are presented that reduce by construction to the inputs. The evaluation results are external to the training process and do not exhibit self-definition, fitted-input-as-prediction, or load-bearing self-citation loops. Standard empirical reporting of this type is self-contained against external benchmarks and receives a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claims rest on the assumption that massive heterogeneous data plus standard MoE scaling produces the stated capabilities; no explicit free parameters, axioms, or invented entities are detailed beyond the introduction of ARIA as a new alignment technique.

invented entities (1)
  • ARIA no independent evidence
    purpose: Dynamic alignment of text and speech tokenizers to improve streaming speech stability and prosody
    Introduced in the abstract as a novel component to address tokenizer efficiency discrepancies; no independent evidence provided beyond the claim of minimal latency impact.

pith-pipeline@v0.9.0 · 5625 in / 1421 out tokens · 39868 ms · 2026-05-10T08:07:27.510743+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 15 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

    cs.SD 2026-05 unverdicted novelty 8.0

    Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

  2. TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

    cs.CV 2026-05 unverdicted novelty 8.0

    TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

  3. Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

    cs.CV 2026-05 unverdicted novelty 7.0

    RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...

  4. Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    STEMO-Bench evaluates intermediate spatio-temporal reasoning in video MLLMs via object-centric facts, and STEMO-Track improves consistency by chunk-wise trajectory construction and aggregation.

  5. FraudBench: A Multimodal Benchmark for Detecting AI-Generated Fraudulent Refund Evidence

    cs.CV 2026-05 unverdicted novelty 7.0

    FraudBench shows that current multimodal LLMs and specialized AI-image detectors often fail to spot AI-generated fake damage in refund evidence, with true positive rates frequently below 50% on synthetic subsets while...

  6. Do Joint Audio-Video Generation Models Understand Physics?

    cs.SD 2026-05 unverdicted novelty 7.0

    Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.

  7. Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

    cs.CV 2026-04 unverdicted novelty 7.0

    Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...

  8. Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

    cs.AI 2026-05 unverdicted novelty 6.0

    MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...

  9. Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

    cs.AI 2026-05 unverdicted novelty 6.0

    MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...

  10. PresentAgent-2: Towards Generalist Multimodal Presentation Agents

    cs.CV 2026-05 unverdicted novelty 6.0

    PresentAgent-2 generates query-driven multimodal presentation videos with research grounding, supporting single-speaker, multi-speaker discussion, and interactive question-answering modes.

  11. Towards Generation-Efficient Uncertainty Estimation in Large Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Uncertainty estimation for LLM hallucinations can be done effectively with partial generations or input-only predictors, reducing the need for full autoregressive sampling.

  12. From History to State: Constant-Context Skill Learning for LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    Constant-context skill learning trains reusable task-family modules for LLM agents using a deterministic state block for progress tracking and subgoal rewards, achieving 89.6% unseen success on ALFWorld, 76.8% on WebS...

  13. OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

    cs.AI 2026-05 unverdicted novelty 5.0

    OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.

  14. Towards Effective Theory of LLMs: A Representation Learning Approach

    cs.LG 2026-05 unverdicted novelty 5.0

    RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.

  15. Step-Audio-R1.5 Technical Report

    eess.AS 2026-04 unverdicted novelty 4.0

    Step-Audio-R1.5 applies RLHF to audio reasoning models to maintain analytical performance while improving prosodic naturalness and immersion in extended spoken interactions.

Reference graph

Works this paper leans on

52 extracted references · 41 canonical work pages · cited by 14 Pith papers · 21 internal anchors

  1. [1]

    Anastassiou, J

    Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models.arXiv preprint arXiv:2406.02430,

  2. [2]

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis M

    URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_ 3.pdf. Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common voice: A massively- multilingual speech corpus. In Nicoletta Calzolari, Frédéric Béchet, Ph...

  3. [3]

    URL https://aclanthology.org/2020.lrec-1.520/. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, ...

  4. [4]

    $\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

    URL https://arxiv.org/abs/2506.07982. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS,

  5. [5]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv:2403.20330, 2024a. Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. Voicebench: Benchmarking llm-based voice assistants.arXiv p...

  6. [6]

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-Audio: Advancing universal audio understanding via unified large-scale audio-language models.CoRR, abs/2311.07919,

  7. [7]

    Qwen2-Audio Technical Report

    Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,

  8. [8]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

  9. [9]

    Fleurs: Few-shot learning evaluation of universal representations of speech.2022 IEEE Spoken Language T echnology Workshop (SLT), pp

    Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech.2022 IEEE Spoken Language T echnology Workshop (SLT), pp. 798–805,

  10. [10]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    URL https: //api.semanticscholar.org/CorpusID:249062909. Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Benchmark- ing spatial understanding for embodied tasks with large vision-language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computa...

  11. [11]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

  12. [12]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv:2405.21075,

  13. [13]

    Are we done with mmlu? CoRR, abs/2406.04127,

    18 Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu?CoRR, abs/2406.04127,

  14. [14]

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou

    URL https://storage.googleapis.com/deepmind-media/gemini/gemi ni_v1_5_report.pdf. Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large visi...

  15. [15]

    Worldsense: Evaluating real-world omnimodal understanding for multimodal llms, 2025

    Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms.CoRR, abs/2502.04326,

  16. [16]

    URLhttps://doi.org/10.1109/TASL.2009.2026503

    doi: 10.1109/TASL.2009.2026503. URLhttps://doi.org/10.1109/TASL.2009.2026503. Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-Eval: A multi- level multi-discipline chinese evaluation suite for foundation models. InNeurIPS,

  17. [17]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code.CoRR, abs/2403.07974,

  18. [18]

    arXiv preprint arXiv:2508.02013

    Changhao Jiang, Jiajun Sun, Yifei Cao, Jiabao Zhuang, Hui Li, Xiaoran Fan, Ming Zhang, Junjie Ye, Shihan Dou, Zhiheng Xi, et al. Speechrole: A large-scale dataset and benchmark for evaluating speech role-playing agents.arXiv preprint arXiv:2508.02013,

  19. [19]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv:2301.12597,

  20. [20]

    Omnigaia: Towards native omni-modal ai agents, 2026

    Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong, Jiajie Jin, Hao Wang, Yinuo Wang, Ji-Rong Wen, Yuan Lu, et al. Omnigaia: Towards native omni-modal ai agents.arXiv preprint arXiv:2602.22897,

  21. [21]

    Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering

    Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In18th IEEE International Sympo- sium on Biomedical Imaging, ISBI 2021, Nice, France, April 13-16, 2021, pp. 1650–1654. IEEE,

  22. [22]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv:2304.08485,

  23. [23]

    doi: 10.1007/s11432-024-423 5-6

    ISSN 1869-1919. doi: 10.1007/s11432-024-423 5-6. URLhttp://dx.doi.org/10.1007/s11432-024-4235-6. 19 Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InICLR,

  24. [24]

    Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V . Le, and Junehyuk Jung. Towards robust mathematical reasoning. InProceedings of the 2025 C...

  25. [25]

    URLhttps://aclanthology.org/2025.emnlp-main.1794/. Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, and Aixin Sun. MMLONGBENCH-DOC: benchmarking long-context document understanding with visualizations. In Amir Globersons, Lester ...

  26. [26]

    Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.arXiv preprint arXiv:2505.13032, 2025

    Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, Tianrui Wang, Yuping Wang, Yuxuan Wang, Yihao Wu, Guanrou Yang, Jianwei Yu, Ruibin Yuan, Zhisheng Zheng, Ziya Zhou, Haina Zhu, Wei Xu...

  27. [27]

    URL https://github.com/openai/openai-python/blob/e389823ba013a24b4c3 2ce38fa0bd87e6bccae94/chatml.md. OpenAI. GPT4 technical report.CoRR, abs/2303.08774,

  28. [28]

    Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel

    URLhttps://openai.com/index/hello-gpt-4o/. Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching CLIP to count to ten. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pp. 3147–3157. IEEE,

  29. [29]

    Librispeech: An ASR corpus based on public domain audio books

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24,

  30. [30]

    Can vision-language models answer face to face questions in the real-world?arXiv preprint arXiv:2503.19356,

    Reza Pourreza, Rishit Dagli, Apratim Bhattacharyya, Sunny Panchal, Guillaume Berger, and Roland Memisevic. Can vision-language models answer face to face questions in the real-world?arXiv preprint arXiv:2503.19356,

  31. [31]

    Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833,

    Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following.CoRR, abs/2507.02833,

  32. [32]

    Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833,

    doi: 10.48550/ARXIV.2507.02833. URL https://doi.org/10.48550/arXiv.2507. 02833. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS,

  33. [33]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022,

  34. [34]

    Zerobench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025

    20 Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, Kai Han, and Samuel Albanie. Zerobench: ...

  35. [35]

    URLhttps://arxiv.org/abs/2410.19168. Yang Shi, Huanqian Wang, Wulin Xie, Huanyao Zhang, Lijie Zhao, Yifan Zhang, Xinfeng Li, Chaoyou Fu, Zhuoer Wen, Wenting Liu, Zhuoran Zhang, Xinlong Chen, Bohan Zeng, Sihan Yang, Yuanxing Zhang, Pengfei Wan, Haotian Wang, and Wenjing Yang. Mme-videoocr: Evaluating ocr-based capabilities of multimodal llms in video scena...

  36. [36]

    Kespeech: An open source speech dataset of mandarin and its eight subdialects

    Zhiyuan Tang, Dong Wang, Yanguang Xu, Jianwei Sun, Xiaoning Lei, Shuaijiang Zhao, Cheng Wen, Xingjun Tan, Chuandong Xie, Shuran Zhou, Rui Yan, Chenjia Lv, Yang Han, Wei Zou, and Xiangang Li. Kespeech: An open source speech dataset of mandarin and its eight subdialects. In Joaquin Vanschoren and Sai-Kit Yeung (eds.),Proceedings of the Neural Information Pr...

  37. [37]

    Gemini Robotics: Bringing AI into the Physical World

    URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/0336dcbab05b9d5ad24f 4333c7658a0e-Abstract-round2.html. Artificial Analysis Team. Artificial analysis long context reasoning benchmark (lcr). Artificial Analysis, Inc., 2025a. Dataset. Gemini Robotics Team. Gemini robotics: Bringing AI into the physical world.CoRR, abs/2503.20020, 2025...

  38. [38]

    Supergpqa: Scaling llm evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025

    doi: 10.48550/ARXIV.2502.14739. URL https://doi.org/10.48550/arXiv.2502.14739. Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February

  39. [39]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    URL https://qwen.ai/blog?id=qwen3.5. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv:2307.09288,

  40. [40]

    Towards understanding chain-of-thought prompting: An empirical study of what matters

    Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. MMSU: A massive multi-task spoken language understanding and reasoning benchmark.CoRR, abs/2506.04779, 2025a. doi: 10.48550/ARXIV.2506.04779. URL https://doi.org/10.48550/arXiv.250 6.04779. Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongshe...

  41. [41]

    URL https://doi.org/10.21437/Interspeech.2022-48

    doi: 10.21437/INTERSPEECH.2022-48. URL https://doi.org/10.21437/Interspeech.2022-48. Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, and Zhizheng Wu. Maskgct: Zero-shot text-to-speech with masked generative codec transformer.arXiv preprint arXiv:2409.00750, 2024c. Yubo Wang, Xuegua...

  42. [42]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others

    Ruiqi Yan, Xiquan Li, Wenxi Chen, Zhikang Niu, Chen Yang, Ziyang Ma, Kai Yu, and Xie Chen. Uro-bench: A comprehensive benchmark for end-to-end spoken dialogue models.arXiv preprint arXiv:2502.17810,

  43. [43]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv:2407.10671, 2024a. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,...

  44. [44]

    MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark.arXiv preprint arXiv:2409.02813,

  45. [45]

    Does the enriched caption accurately align with the musical content of the audio (e.g., instrumenta- tion, vocals, tempo, mood, and production cues)?

    Yongyi Zang, Sean O’Brien, Taylor Berg-Kirkpatrick, Julian McAuley, and Zachary Novack. Are you really listening? boosting perceptual awareness in music-qa benchmarks.arXiv preprint arXiv:2504.00369,

  46. [46]

    WENETSPEECH: A 10000+ hours multi-domain mandarin corpus for speech recognition

    Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu, and Zhendong Peng. WENETSPEECH: A 10000+ hours multi-domain mandarin corpus for speech recognition. InIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, pp. 6182...

  47. [47]

    Audioclip: Extending clip to image, text and audio

    doi: 10.1109/ICASSP43922.2022.9746682. URLhttps://doi.org/10.1109/ICASSP43922.2022.9746682. 22 Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, and Yucen He. Minimax-spe...

  48. [48]

    Mimo-audio: Audio language models are few-shot learners

    Xiaomi LLM-Core Team Dong Zhang, Gang Wang, Jinlong Xue, Kai Fang, Liang Zhao, Rui Ma, Shu- Qin Ren, Shuo Liu, Tao Guo, Weiji Zhuang, Xin Zhang, Xi-Na Song, Yihan Yan, Yongzhe He, Cici, Bowen Shen, Chengxuan Zhu, Chong Ma, Chun Chen, Heyu Chen, Jiawei Li, Lei Li, Menghang Zhu, Peidian Li, Qiying Wang, Sirui Deng, Weimin Xiong, Wen Huang, Wenyu Yang, Yilin...

  49. [49]

    MMVU: measuring expert-level multi-discipline video understanding

    Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, Chengye Wang, Ziyao Shangguan, Zhenwen Liang, Yixin Liu, Chen Zhao, and Arman Cohan. MMVU: measuring expert-level multi-discipline video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashvill...

  50. [50]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

  51. [51]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.CoRR, abs/2311.07911,

  52. [52]

    Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities, 2025

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: benchmarking multi-task long video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pp. 13691–13701. Computer Vision Founda...