Recognition: unknown
Qwen3.5-Omni Technical Report
Pith reviewed 2026-05-10 08:07 UTC · model grok-4.3
The pith
Qwen3.5-Omni scales to hundreds of billions of parameters and reports state-of-the-art results on 215 audio-visual benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Qwen3.5-Omni scales to hundreds of billions of parameters with 256k context length and trains on heterogeneous text-vision pairs plus over 100 million hours of audio-visual content. The model employs a Hybrid Attention Mixture-of-Experts framework for its Thinker and Talker components to support efficient long-sequence inference. The plus variant achieves leading results on 215 audio and audio-visual subtasks and benchmarks. ARIA is introduced to dynamically align text and speech units, improving stability and prosody in conversational speech synthesis with limited extra latency. The model supports multilingual understanding and generation across 10 languages with emotional expression, shows
What carries the argument
The Hybrid Attention Mixture-of-Experts framework for Thinker and Talker modules together with the ARIA mechanism that aligns text and speech tokenizers.
If this is right
- The model can process and understand over 10 hours of audio and up to 400 seconds of 720P video at 1 FPS within a single context window.
- Streaming speech synthesis gains stability and natural prosody through dynamic alignment of text and speech units while keeping latency low.
- Multilingual speech understanding and generation extend to 10 languages while preserving human-like emotional tone.
- Audio-visual input yields script-level structured captions with exact temporal synchronization and automated scene segmentation.
- An emergent ability appears for performing coding tasks directly from audio-visual instructions.
Where Pith is reading between the lines
- If the benchmark results prove robust, comparable scaling strategies could shorten the path to reliable real-time omnimodal assistants for education or accessibility applications.
- The reported Audio-Visual Vibe Coding capability suggests models may eventually accept creative multi-sensory prompts to generate or edit code without requiring separate text input.
- Long audio and video context windows open routes for single-pass analysis of full meetings, lectures, or films to produce summaries or insights.
- Training at the reported data volume raises the question of whether focused curation of high-quality audio-visual examples could achieve similar gains with reduced overall compute.
Load-bearing premise
The 215 subtasks and benchmarks provide a fair, comprehensive, and unbiased measure of omnimodal capability without post-hoc selection or undisclosed evaluation details.
What would settle it
Independent re-testing on a new collection of audio and audio-visual tasks outside the original 215 that shows no performance advantage over prior leading models.
Figures
read the original abstract
In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Qwen3.5-Omni, the latest in the Qwen-Omni model family. It scales to hundreds of billions of parameters with a 256k context length and is trained on heterogeneous text-vision pairs and over 100 million hours of audio-visual content. The model uses a Hybrid Attention Mixture-of-Experts framework and introduces ARIA to improve streaming speech synthesis stability. It claims SOTA results on 215 audio and audio-visual subtasks, surpassing Gemini-3.1 Pro in key areas, supports 10 languages, and demonstrates new capabilities like Audio-Visual Vibe Coding.
Significance. If the SOTA claims hold under rigorous scrutiny, this would mark a notable step forward in omnimodal models capable of long-context audio-visual reasoning and interaction. The ARIA innovation could influence future work on multimodal token alignment. The report highlights potential emergent abilities, which is of interest to the community. However, without detailed benchmarks, the significance remains provisional.
major comments (3)
- The assertion of SOTA performance across 215 subtasks lacks any quantitative tables, per-subtask scores, baseline comparisons, or details on how these subtasks were chosen or evaluated, which is load-bearing for the central claim as noted in the stress-test.
- No error bars, statistical significance, or data exclusion criteria are mentioned for the comparisons to Gemini-3.1 Pro, undermining confidence in the reported superiority.
- The description of ARIA lacks specific algorithmic details, such as the alignment method or loss function, making it difficult to reproduce or evaluate its impact on prosody and latency.
minor comments (2)
- Specify the exact parameter count instead of 'hundreds of billions' for precision.
- Distinguish clearly between Qwen3.5-Omni and Qwen3.5-Omni-plus in the performance claims.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on the Qwen3.5-Omni technical report. We address each major comment below and have revised the manuscript accordingly to provide greater transparency and detail.
read point-by-point responses
-
Referee: The assertion of SOTA performance across 215 subtasks lacks any quantitative tables, per-subtask scores, baseline comparisons, or details on how these subtasks were chosen or evaluated, which is load-bearing for the central claim as noted in the stress-test.
Authors: We agree that the high-level claim requires supporting quantitative evidence. In the revised manuscript we have added comprehensive tables in Section 4 and the appendix that report per-subtask scores, direct baseline comparisons (including Gemini-3.1 Pro), and explicit criteria for subtask selection and evaluation protocols. revision: yes
-
Referee: No error bars, statistical significance, or data exclusion criteria are mentioned for the comparisons to Gemini-3.1 Pro, undermining confidence in the reported superiority.
Authors: We acknowledge the need for these statistical details. The revised version now includes error bars derived from multiple evaluation runs, reports p-values for key comparisons, and specifies the data exclusion criteria applied during benchmarking. revision: yes
-
Referee: The description of ARIA lacks specific algorithmic details, such as the alignment method or loss function, making it difficult to reproduce or evaluate its impact on prosody and latency.
Authors: We have expanded the ARIA description to include the precise alignment mechanism (cross-modal dynamic unit alignment via learned projections), the composite loss function (alignment loss plus prosody consistency and latency regularization terms), and additional ablation results on prosody and latency. revision: yes
Circularity Check
No significant circularity in claimed results
full rationale
The paper is an empirical technical report describing model scaling, training data volume, architectural choices (Hybrid Attention MoE), and the introduction of ARIA for speech stability. Its central claims consist of reported benchmark performance (SOTA across 215 subtasks, comparisons to Gemini-3.1 Pro) and qualitative observations such as Audio-Visual Vibe Coding. No mathematical derivation chain, first-principles equations, or parameter-fitting steps are presented that reduce by construction to the inputs. The evaluation results are external to the training process and do not exhibit self-definition, fitted-input-as-prediction, or load-bearing self-citation loops. Standard empirical reporting of this type is self-contained against external benchmarks and receives a score of 0.
Axiom & Free-Parameter Ledger
invented entities (1)
-
ARIA
no independent evidence
Forward citations
Cited by 15 Pith papers
-
Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search
Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...
-
Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models
STEMO-Bench evaluates intermediate spatio-temporal reasoning in video MLLMs via object-centric facts, and STEMO-Track improves consistency by chunk-wise trajectory construction and aggregation.
-
FraudBench: A Multimodal Benchmark for Detecting AI-Generated Fraudulent Refund Evidence
FraudBench shows that current multimodal LLMs and specialized AI-image detectors often fail to spot AI-generated fake damage in refund evidence, with true positive rates frequently below 50% on synthetic subsets while...
-
Do Joint Audio-Video Generation Models Understand Physics?
Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.
-
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...
-
PresentAgent-2: Towards Generalist Multimodal Presentation Agents
PresentAgent-2 generates query-driven multimodal presentation videos with research grounding, supporting single-speaker, multi-speaker discussion, and interactive question-answering modes.
-
Towards Generation-Efficient Uncertainty Estimation in Large Language Models
Uncertainty estimation for LLM hallucinations can be done effectively with partial generations or input-only predictors, reducing the need for full autoregressive sampling.
-
From History to State: Constant-Context Skill Learning for LLM Agents
Constant-context skill learning trains reusable task-family modules for LLM agents using a deterministic state block for progress tracking and subgoal rewards, achieving 89.6% unseen success on ALFWorld, 76.8% on WebS...
-
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
-
Towards Effective Theory of LLMs: A Representation Learning Approach
RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
-
Step-Audio-R1.5 Technical Report
Step-Audio-R1.5 applies RLHF to audio reasoning models to maintain analytical performance while improving prosodic naturalness and immersion in extended spoken interactions.
Reference graph
Works this paper leans on
-
[1]
Seed-tts: A family of high-quality versatile speech generation models,
Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models.arXiv preprint arXiv:2406.02430,
-
[2]
Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis M
URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_ 3.pdf. Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common voice: A massively- multilingual speech corpus. In Nicoletta Calzolari, Frédéric Béchet, Ph...
2020
-
[3]
URL https://aclanthology.org/2020.lrec-1.520/. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[4]
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
URL https://arxiv.org/abs/2506.07982. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS,
work page internal anchor Pith review arXiv
-
[5]
Are We on the Right Way for Evaluating Large Vision-Language Models?
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv:2403.20330, 2024a. Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. Voicebench: Benchmarking llm-based voice assistants.arXiv p...
work page internal anchor Pith review arXiv
-
[6]
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-Audio: Advancing universal audio understanding via unified large-scale audio-language models.CoRR, abs/2311.07919,
work page internal anchor Pith review arXiv
-
[7]
Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,
work page internal anchor Pith review arXiv
-
[8]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Fleurs: Few-shot learning evaluation of universal representations of speech.2022 IEEE Spoken Language T echnology Workshop (SLT), pp
Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech.2022 IEEE Spoken Language T echnology Workshop (SLT), pp. 798–805,
2022
-
[10]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
URL https: //api.semanticscholar.org/CorpusID:249062909. Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Benchmark- ing spatial understanding for embodied tasks with large vision-language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computa...
work page internal anchor Pith review arXiv 2024
-
[11]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv:2405.21075,
work page internal anchor Pith review arXiv
-
[13]
Are we done with mmlu? CoRR, abs/2406.04127,
18 Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu?CoRR, abs/2406.04127,
-
[14]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou
URL https://storage.googleapis.com/deepmind-media/gemini/gemi ni_v1_5_report.pdf. Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large visi...
2024
-
[15]
arXiv preprint arXiv:2502.04326 (2025)
Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms.CoRR, abs/2502.04326,
-
[16]
URLhttps://doi.org/10.1109/TASL.2009.2026503
doi: 10.1109/TASL.2009.2026503. URLhttps://doi.org/10.1109/TASL.2009.2026503. Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-Eval: A multi- level multi-discipline chinese evaluation suite for foundation models. InNeurIPS,
-
[17]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code.CoRR, abs/2403.07974,
work page internal anchor Pith review arXiv
-
[18]
arXiv preprint arXiv:2508.02013
Changhao Jiang, Jiajun Sun, Yifei Cao, Jiabao Zhuang, Hui Li, Xiaoran Fan, Ming Zhang, Junjie Ye, Shihan Dou, Zhiheng Xi, et al. Speechrole: A large-scale dataset and benchmark for evaluating speech role-playing agents.arXiv preprint arXiv:2508.02013,
-
[19]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv:2301.12597,
work page internal anchor Pith review arXiv
-
[20]
Omnigaia: Towards native omni-modal ai agents, 2026
Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong, Jiajie Jin, Hao Wang, Yinuo Wang, Ji-Rong Wen, Yuan Lu, et al. Omnigaia: Towards native omni-modal ai agents.arXiv preprint arXiv:2602.22897,
-
[21]
Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering
Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In18th IEEE International Sympo- sium on Biomedical Imaging, ISBI 2021, Nice, France, April 13-16, 2021, pp. 1650–1654. IEEE,
2021
-
[22]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv:2304.08485,
work page internal anchor Pith review arXiv
-
[23]
doi: 10.1007/s11432-024-423 5-6
ISSN 1869-1919. doi: 10.1007/s11432-024-423 5-6. URLhttp://dx.doi.org/10.1007/s11432-024-4235-6. 19 Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InICLR,
-
[24]
Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V . Le, and Junehyuk Jung. Towards robust mathematical reasoning. InProceedings of the 2025 C...
2025
-
[25]
URLhttps://aclanthology.org/2025.emnlp-main.1794/. Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, and Aixin Sun. MMLONGBENCH-DOC: benchmarking long-context document understanding with visualizations. In Amir Globersons, Lester ...
2025
-
[26]
Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, Tianrui Wang, Yuping Wang, Yuxuan Wang, Yihao Wu, Guanrou Yang, Jianwei Yu, Ruibin Yuan, Zhisheng Zheng, Ziya Zhou, Haina Zhu, Wei Xu...
-
[27]
URL https://github.com/openai/openai-python/blob/e389823ba013a24b4c3 2ce38fa0bd87e6bccae94/chatml.md. OpenAI. GPT4 technical report.CoRR, abs/2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel
URLhttps://openai.com/index/hello-gpt-4o/. Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching CLIP to count to ten. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pp. 3147–3157. IEEE,
2023
-
[29]
Librispeech: An ASR corpus based on public domain audio books
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24,
2015
-
[30]
Reza Pourreza, Rishit Dagli, Apratim Bhattacharyya, Sunny Panchal, Guillaume Berger, and Roland Memisevic. Can vision-language models answer face to face questions in the real-world?arXiv preprint arXiv:2503.19356,
-
[31]
Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833,
Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following.CoRR, abs/2507.02833,
-
[32]
Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833,
doi: 10.48550/ARXIV.2507.02833. URL https://doi.org/10.48550/arXiv.2507. 02833. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS,
-
[33]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022,
work page internal anchor Pith review arXiv
-
[34]
20 Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, Kai Han, and Samuel Albanie. Zerobench: ...
-
[35]
URLhttps://arxiv.org/abs/2410.19168. Yang Shi, Huanqian Wang, Wulin Xie, Huanyao Zhang, Lijie Zhao, Yifan Zhang, Xinfeng Li, Chaoyou Fu, Zhuoer Wen, Wenting Liu, Zhuoran Zhang, Xinlong Chen, Bohan Zeng, Sihan Yang, Yuanxing Zhang, Pengfei Wan, Haotian Wang, and Wenjing Yang. Mme-videoocr: Evaluating ocr-based capabilities of multimodal llms in video scena...
work page internal anchor Pith review arXiv
-
[36]
Kespeech: An open source speech dataset of mandarin and its eight subdialects
Zhiyuan Tang, Dong Wang, Yanguang Xu, Jianwei Sun, Xiaoning Lei, Shuaijiang Zhao, Cheng Wen, Xingjun Tan, Chuandong Xie, Shuran Zhou, Rui Yan, Chenjia Lv, Yang Han, Wei Zou, and Xiangang Li. Kespeech: An open source speech dataset of mandarin and its eight subdialects. In Joaquin Vanschoren and Sai-Kit Yeung (eds.),Proceedings of the Neural Information Pr...
2021
-
[37]
Gemini Robotics: Bringing AI into the Physical World
URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/0336dcbab05b9d5ad24f 4333c7658a0e-Abstract-round2.html. Artificial Analysis Team. Artificial analysis long context reasoning benchmark (lcr). Artificial Analysis, Inc., 2025a. Dataset. Gemini Robotics Team. Gemini robotics: Bringing AI into the physical world.CoRR, abs/2503.20020, 2025...
work page internal anchor Pith review arXiv 2021
-
[38]
doi: 10.48550/ARXIV.2502.14739. URL https://doi.org/10.48550/arXiv.2502.14739. Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February
-
[39]
Llama 2: Open Foundation and Fine-Tuned Chat Models
URL https://qwen.ai/blog?id=qwen3.5. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Towards understanding chain-of-thought prompting: An empirical study of what matters
Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. MMSU: A massive multi-task spoken language understanding and reasoning benchmark.CoRR, abs/2506.04779, 2025a. doi: 10.48550/ARXIV.2506.04779. URL https://doi.org/10.48550/arXiv.250 6.04779. Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongshe...
-
[41]
URL https://doi.org/10.21437/Interspeech.2022-48
doi: 10.21437/INTERSPEECH.2022-48. URL https://doi.org/10.21437/Interspeech.2022-48. Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, and Zhizheng Wu. Maskgct: Zero-shot text-to-speech with masked generative codec transformer.arXiv preprint arXiv:2409.00750, 2024c. Yubo Wang, Xuegua...
-
[42]
Ruiqi Yan, Xiquan Li, Wenxi Chen, Zhikang Niu, Chen Yang, Ziyang Ma, Kai Yu, and Xie Chen. Uro-bench: A comprehensive benchmark for end-to-end spoken dialogue models.arXiv preprint arXiv:2502.17810,
-
[43]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv:2407.10671, 2024a. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,...
work page internal anchor Pith review arXiv 2025
-
[44]
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark.arXiv preprint arXiv:2409.02813,
work page internal anchor Pith review arXiv
-
[45]
Yongyi Zang, Sean O’Brien, Taylor Berg-Kirkpatrick, Julian McAuley, and Zachary Novack. Are you really listening? boosting perceptual awareness in music-qa benchmarks.arXiv preprint arXiv:2504.00369,
-
[46]
WENETSPEECH: A 10000+ hours multi-domain mandarin corpus for speech recognition
Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu, and Zhendong Peng. WENETSPEECH: A 10000+ hours multi-domain mandarin corpus for speech recognition. InIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, pp. 6182...
2022
-
[47]
Audioclip: Extending clip to image, text and audio
doi: 10.1109/ICASSP43922.2022.9746682. URLhttps://doi.org/10.1109/ICASSP43922.2022.9746682. 22 Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, and Yucen He. Minimax-spe...
-
[48]
Mimo-audio: Audio language models are few-shot learners
Xiaomi LLM-Core Team Dong Zhang, Gang Wang, Jinlong Xue, Kai Fang, Liang Zhao, Rui Ma, Shu- Qin Ren, Shuo Liu, Tao Guo, Weiji Zhuang, Xin Zhang, Xi-Na Song, Yihan Yan, Yongzhe He, Cici, Bowen Shen, Chengxuan Zhu, Chong Ma, Chun Chen, Heyu Chen, Jiawei Li, Lei Li, Menghang Zhu, Peidian Li, Qiying Wang, Sirui Deng, Weimin Xiong, Wen Huang, Wenyu Yang, Yilin...
-
[49]
MMVU: measuring expert-level multi-discipline video understanding
Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, Chengye Wang, Ziyao Shangguan, Zhenwen Liang, Yixin Liu, Chen Zhao, and Arman Cohan. MMVU: measuring expert-level multi-discipline video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashvill...
2025
-
[50]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,
work page internal anchor Pith review arXiv
-
[51]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.CoRR, abs/2311.07911,
work page internal anchor Pith review Pith/arXiv arXiv
-
[52]
Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities, 2025
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: benchmarking multi-task long video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pp. 13691–13701. Computer Vision Founda...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.