Recognition: unknown
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
Pith reviewed 2026-05-07 09:20 UTC · model grok-4.3
The pith
A 9B model achieves real-time full-duplex omni-modal interaction by aligning vision, audio, and speech on one shared timeline.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MiniCPM-o 4.5 performs real-time full-duplex omni-modal interaction. Its Omni-Flow framework places vision streams, audio input, and speech output on the same temporal axis so that perception and generation overlap instead of alternating. This lets the model observe a live scene, listen to speech, and produce its own speech at the same time while also generating proactive remarks drawn from ongoing scene analysis.
What carries the argument
Omni-Flow, a unified streaming framework that aligns omni-modal inputs and outputs along a shared temporal axis, converting turn-based interaction into continuous time-aligned processing.
If this is right
- The model issues reminders or comments based on continuous live-scene understanding.
- It reaches near Gemini 2.5 Flash vision-language performance with only 9B parameters.
- Real-time full-duplex operation runs on edge devices with under 12 GB RAM.
- Omni-modal understanding and speech quality exceed those of the larger Qwen3-Omni-30B-A3B while using far less compute.
Where Pith is reading between the lines
- Always-on devices could maintain environmental awareness without explicit wake words or repeated cloud round-trips.
- The same timeline alignment could support natural interruption handling in multi-speaker conversations.
- Extending the framework to new sensors would let assistants draw on richer ongoing context.
- Open release at this scale may speed development of responsive personal AI that stays aware of its surroundings.
Load-bearing premise
The shared temporal axis must by itself be enough to produce genuine simultaneous perception and response plus proactive behavior without extra rules or hidden turn detection.
What would settle it
Run a test in which the model is already speaking when the visual scene changes or new audio arrives; if it revises its ongoing speech or inserts a relevant comment within one or two seconds, the central claim holds.
Figures
read the original abstract
Recent progress in multimodal large language models (MLLMs) has brought AI capabilities from static offline data processing to real-time streaming interaction, yet they still remain far from human-level multimodal interaction. The key bottlenecks are no longer modality coverage or latency alone, but the interaction paradigm itself. First, perception and response are still separated into alternating phases, preventing models from incorporating new inputs for timely adjustment during generation. Second, most current models remain reactive, responding only to explicit user requests instead of acting proactively in the evolving multimodal environment. We present MiniCPM-o 4.5, our latest effort towards human-like multimodal interaction, which mitigates these gaps by real-time full-duplex omni-modal interaction. It can see, listen, and speak simultaneously in real-time, while also exhibiting proactive behaviors such as issuing reminders or comments based on its continuous understanding of the live scene. The key technique behind MiniCPM-o 4.5 is Omni-Flow, a unified streaming framework that aligns omni-modal inputs and outputs along a shared temporal axis. This formulation converts conventional turn-based interaction into a full-duplex, time-aligned process, enabling simultaneous perception and response and allowing proactive behavior to arise within the same framework. With a total of 9B parameters, MiniCPM-o 4.5 approaches Gemini 2.5 Flash in vision-language capabilities, delivering state-of-the-art open-source performance at its scale. It also surpasses Qwen3-Omni-30B-A3B in omni-modal understanding and delivers better speech generation, with significantly higher computation efficiency. Driven by its efficient architecture design and inference optimization, the model can perform real-time full-duplex omni-modal interaction on edge devices with less than 12GB RAM cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MiniCPM-o 4.5, a 9B-parameter omni-modal model for real-time full-duplex interaction. It claims the model can simultaneously see, listen, and speak while exhibiting proactive behaviors (e.g., unprompted reminders based on live scene understanding) via the Omni-Flow unified streaming framework, which aligns omni-modal inputs/outputs on a shared temporal axis to convert turn-based interaction into full-duplex time-aligned processing. Additional claims include approaching Gemini 2.5 Flash in vision-language capabilities, surpassing Qwen3-Omni-30B-A3B in omni-modal understanding, better speech generation, and efficient edge-device operation under 12GB RAM.
Significance. If the empirical claims hold under rigorous testing, the work would advance the field by addressing core interaction-paradigm bottlenecks in MLLMs beyond mere latency or modality coverage. The parameter-efficient design and potential for proactive, simultaneous perception-response could influence practical real-time applications. The framework's derivation of full-duplex and proactive capabilities directly from temporal alignment (without auxiliary rules) would be a meaningful contribution if isolated and validated.
major comments (3)
- [Abstract] Abstract: The central claims of real-time simultaneous perception-response and proactive behavior arising from Omni-Flow are presented without any reported latency measurements, live-stream evaluations, ablation studies on the temporal alignment mechanism, or error bars; standard sequential omni-modal benchmarks do not directly test whether new inputs during generation trigger adjustments or unprompted comments.
- [Abstract] Abstract: The assertion that Omni-Flow 'converts conventional turn-based interaction into a full-duplex, time-aligned process' enabling proactivity 'within the same framework' lacks isolation from possible unstated post-training control logic or hidden turn-taking; the reported SOTA and Gemini-comparable scores appear drawn from conventional benchmarks that do not evaluate this specific capability.
- [Abstract] Abstract: The performance comparison stating the model 'approaches Gemini 2.5 Flash in vision-language capabilities' and 'surpasses Qwen3-Omni-30B-A3B' is unsupported by any numerical scores, dataset details, or experimental protocol in the provided text, undermining the scale-efficiency claims.
minor comments (2)
- [Abstract] Abstract: The total parameter count of 9B is stated without breakdown across components or confirmation of whether it includes all modality encoders/decoders.
- [Abstract] Abstract: The phrase 'significantly higher computation efficiency' is used without reference to specific metrics (e.g., tokens per second or FLOPs) or comparison baselines.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify how the abstract can better convey the empirical support for our claims. We address each point below and will revise the manuscript to improve transparency and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of real-time simultaneous perception-response and proactive behavior arising from Omni-Flow are presented without any reported latency measurements, live-stream evaluations, ablation studies on the temporal alignment mechanism, or error bars; standard sequential omni-modal benchmarks do not directly test whether new inputs during generation trigger adjustments or unprompted comments.
Authors: We agree the abstract is too concise on these aspects. The full manuscript provides latency measurements, dedicated live-stream evaluations of simultaneous perception and response, ablations isolating the temporal alignment mechanism, and error bars on all quantitative tables. Custom streaming protocols (beyond standard benchmarks) are used to test dynamic input handling and proactivity. We will revise the abstract to reference these elements and the evaluation protocol. revision: yes
-
Referee: [Abstract] Abstract: The assertion that Omni-Flow 'converts conventional turn-based interaction into a full-duplex, time-aligned process' enabling proactivity 'within the same framework' lacks isolation from possible unstated post-training control logic or hidden turn-taking; the reported SOTA and Gemini-comparable scores appear drawn from conventional benchmarks that do not evaluate this specific capability.
Authors: Omni-Flow derives full-duplex and proactive behavior directly from continuous temporal alignment of inputs and outputs, without separate turn-taking rules or auxiliary control logic; this is formalized in the method section. The SOTA results incorporate both standard benchmarks and our streaming-specific evaluations. To strengthen isolation, we will add an explicit comparison to a turn-based baseline in the revised manuscript and clarify the benchmark types in the abstract. revision: partial
-
Referee: [Abstract] Abstract: The performance comparison stating the model 'approaches Gemini 2.5 Flash in vision-language capabilities' and 'surpasses Qwen3-Omni-30B-A3B' is unsupported by any numerical scores, dataset details, or experimental protocol in the provided text, undermining the scale-efficiency claims.
Authors: The abstract summarizes high-level outcomes; the Experiments section contains the full numerical scores, dataset details, and protocol for the vision-language and omni-modal comparisons. We will update the abstract to include representative numerical values and dataset references drawn from those results. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper presents MiniCPM-o 4.5 as an empirical system whose full-duplex omni-modal capabilities are attributed to the Omni-Flow framework, which is described as aligning inputs and outputs on a shared temporal axis. No equations, derivations, or first-principles reductions appear in the provided text that would make any claimed capability (simultaneous perception-response or proactive behavior) equivalent to its own inputs by construction. Performance claims are tied to benchmark results and architecture efficiency rather than self-referential definitions or fitted parameters renamed as predictions. The central claims therefore remain self-contained against external evaluation.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Omni-Flow
no independent evidence
Reference graph
Works this paper leans on
-
[1]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. MiniCPM-V: A GPT-4V Level MLLM on Your Phone.ArXiv preprint, abs/2408.01800, 2024
work page internal anchor Pith review arXiv 2024
-
[2]
Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe, 2025
Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, Chi Chen, Guoyang Zeng, Yuxuan Li, Ganqu Cui, Ning D...
2025
-
[3]
Qwen2.5-VL Technical Report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Report, 2025
2025
-
[4]
Qwen3-vl technical report, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
2025
-
[5]
Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images
Zonghao Guo, Ruyi Xu, Yuan Yao, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images. InEuropean Conference on Computer Vision, pages 390–406. Springer, 2024
2024
-
[6]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, October 2023
2023
-
[7]
Qwen3-omni technical report, 2025
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...
2025
-
[8]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Andreas Krause, Emma Brun- skill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Pro- ceedings of the 40th International Conference on Machine Learning, volume 202 ofPro...
2023
-
[9]
Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit
Zhuoyuan Yao, Di Wu 0061, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, and Xin Lei. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. Ininterspeech, volume 2021, pages 4054–4058, 2021
2021
-
[10]
Qwen3 Technical Report, 2025
Qwen Team. Qwen3 Technical Report, 2025
2025
-
[11]
Mini-omni: Language models can hear, talk while thinking in streaming, 2024
Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming, 2024
2024
-
[12]
Step-audio 2 technical report, 2025
Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen, Siyu Chen, Song Yuan, Xuelin Zhang, Yimin Jiang, Yu Zho...
2025
-
[13]
Chi-Yuan Hsiao, Ke-Han Lu, Kai-Wei Chang, Chih-Kai Yang, Wei-Chih Chen, and Hung-yi Lee. Analyzing mitigation strategies for catastrophic forgetting in end-to-end training of spoken language models.arXiv preprint arXiv:2505.17496, 2025
-
[14]
Qwen2.5-omni technical report, 2025
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025
2025
-
[15]
Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens, 2024
Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, and Zhijie Yan. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens, 2024
2024
-
[16]
Cosyvoice 2: Scalable streaming speech synthesis with large language models, 2024
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, and Jingren Zhou. Cosyvoice 2: Scalable streaming speech synthesis with large language models, 2024
2024
-
[17]
A statistical model-based voice activity detection.IEEE signal processing letters, 6(1):1–3, 1999
Jongseo Sohn, Nam Soo Kim, and Wonyong Sung. A statistical model-based voice activity detection.IEEE signal processing letters, 6(1):1–3, 1999
1999
-
[18]
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025
work page internal anchor Pith review arXiv 2025
-
[19]
Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier.https://github.com/snakers4/silero-vad, 2024
Silero Team. Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier.https://github.com/snakers4/silero-vad, 2024
2024
-
[20]
Robust speech recognition via large-scale weak supervision, 2022
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision, 2022
2022
-
[21]
Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition, 2023
Zhifu Gao, Shiliang Zhang, Ian McLoughlin, and Zhijie Yan. Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition, 2023
2023
-
[22]
Leveraging self-supervised learning for speaker diarization, 2024
Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Diez, and Lukas Burget. Leveraging self-supervised learning for speaker diarization, 2024
2024
-
[23]
Music source separation in the waveform domain, 2021
Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach. Music source separation in the waveform domain, 2021
2021
-
[24]
CapsFusion: Rethinking Image-Text Data at Scale
Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Yue Cao, Xinlong Wang, and Jingjing Liu. CapsFusion: Rethinking Image-Text Data at Scale. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 14022–14032. IEEE, 2024. 16
2024
-
[25]
Minicpm4: Ultra-efficient llms on end devices
MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, et al. Minicpm4: Ultra-efficient llms on end devices. arXiv preprint arXiv:2506.07900, 2025
-
[26]
Paddleocr 3.0 technical report, 2025
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report, 2025
2025
-
[27]
Livecc: Learning video llm with streaming speech transcription at scale, 2025
Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video llm with streaming speech transcription at scale, 2025
2025
-
[28]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.ArXiv preprint, abs/2402.03300, 2024
work page internal anchor Pith review arXiv 2024
-
[29]
Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wenwei Zhang, Derek F Wong, Songyang Zhang, and Kai Chen. Compassverifier: A unified and robust verifier for llms evaluation and outcome reward.arXiv preprint arXiv:2508.03686, 2025
-
[30]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with llms.ArXiv preprint, abs/2501.12599, 2025
work page internal anchor Pith review arXiv 2025
-
[31]
RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness, 2024
Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, Xiaocheng Feng, Jun Song, Bo Zheng, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness, 2024
2024
-
[32]
OpenCompass: A Universal Evaluation Platform for Foundation Models.https://github.com/open-compass/opencompass, 2023
OpenCompass Contributors. OpenCompass: A Universal Evaluation Platform for Foundation Models.https://github.com/open-compass/opencompass, 2023
2023
-
[33]
Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024
2024
-
[34]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024
2024
-
[35]
Are We on the Right Way for Evaluating Large Vision-Language Models? In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are We on the Right Way for Evaluating Large Vision-Language Models? In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information P...
2024
-
[36]
MMMU: A massive multi-discipline multimodal understanding and reason- ing benchmark for expert AGI
Xiang Yue, Yuansheng Ni, Tianyu Zheng, Kai Zhang, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A massive multi-discipline multimodal understanding and reason- ing benchmark for e...
2024
-
[37]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InProc. of ICLR. OpenReview.net, 2024
2024
-
[38]
A Diagram is Worth a Dozen Images
Aniruddha Kembhavi, Michael Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A Diagram is Worth a Dozen Images. InEuropean Conference on Computer Vision (ECCV), 2016. 17
2016
-
[39]
Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask AGI
Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models toward...
2024
-
[40]
Mm-ifengine: Towards multimodal instruction following.ArXiv preprint, abs/2504.07957, 2025
Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Mm-ifengine: Towards multimodal instruction following.ArXiv preprint, abs/2504.07957, 2025
-
[41]
OCRBench: On the hidden mystery of OCR in large multimodal models.Science China Information Sciences, 2024
Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, and Xiang Bai. OCRBench: On the hidden mystery of OCR in large multimodal models.Science China Information Sciences, 2024
2024
-
[42]
TextVQA: Towards VQA requiring reasoning about text
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. TextVQA: Towards VQA requiring reasoning about text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019
2019
-
[43]
Manmatha, and C
Minesh Mathew, Dimosthenis Karatzas, R. Manmatha, and C. V . Jawahar. DocVQA: A dataset for VQA on document images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021
2021
-
[44]
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations, 2024
Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, and Conghui He. OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations, 2024
2024
-
[45]
Mantis: Interleaved multi-image instruction tuning
Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.ArXiv preprint, abs/2405.01483, 2024
-
[46]
Fei Wang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. Muirbench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024
-
[47]
Mmsi-bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025
Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. Mmsi-bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025
-
[48]
Hallusion- bench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusion- bench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recogni...
2024
-
[49]
Aligning large multimodal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf.ArXiv preprint, abs/2309.14525, 2023
-
[50]
Video-MME: The First-Ever Comprehen- sive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-MME: The First-Ever Comprehen- sive Evaluation Benchmark of Multi-modal LLMs in Video Analysis. 2025
2025
-
[51]
Lvbench: An extreme long video understanding benchmark.CoRR, abs/2406.08035, 2024
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. LVBench: An Extreme Long Video Understanding Benchmark.ArXiv preprint, abs/2406.08035, 2024
-
[52]
Mlvu: Benchmarking multi-task long video understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13691–13701, 2025. 18
2025
-
[53]
Longvideobench: A benchmark for long- context interleaved video-language understanding
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long- context interleaved video-language understanding. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Pro...
2024
-
[54]
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models, 2024
Wenyi Hong*, Yean Cheng*, Zhuoyi Yang*, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models, 2024
2024
-
[55]
Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline
Hui Bu, Jiatong Du, Xingyu Na, Bengu Wu, and Hao Zheng. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pages 1–5. IEEE, 2017
2017
-
[56]
Aishell-2: Transform- ing mandarin asr research into industrial scale,
Jiatong Du, Xingyu Na, Xuechen Liu, and Hui Bu. Aishell-2: Transforming mandarin asr research into industrial scale.arXiv preprint arXiv:1808.10583, 2018
-
[57]
Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition
Binbin Zhang, Hang Lv, Haowen Guo, et al. Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. InICASSP, pages 6182–6186. IEEE, 2022
2022
-
[58]
Librispeech: An ASR corpus based on public domain audio books
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. InICASSP, pages 5206–5210. IEEE, 2015
2015
-
[59]
Gigaspeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio
Guoguo Chen, Wei Chai, Jiatong Wang, et al. Gigaspeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio. InInterspeech, pages 3670–3674, 2021
2021
-
[60]
V oxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation
Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. V oxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In ACL-IJCNLP, pages 993–1003, 2021
2021
-
[61]
ChenWang,MinpengLiao,ZhongqiangHuang,JinliangLu,JunhongWu,YuchenLiu,ChengqingZong,and Jiajun Zhang
Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, and Juan Pino. CoV oST 2 and massively multilingual speech-to-text translation.arXiv preprint arXiv:2007.10310, 2020
-
[62]
MELD: A multimodal multi-party dataset for emotion recognition in conversa- tions
Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. MELD: A multimodal multi-party dataset for emotion recognition in conversa- tions. InACL, pages 527–536, 2019
2019
-
[63]
H., Pasad, A., Casanova, E., Wang, W., Fu, S.-W., Li, J., Chen, Z., Balam, J., et al
Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T. Tan, and Haizhou Li. V oiceBench: Benchmarking LLM-based voice assistants.arXiv preprint arXiv:2410.17196, 2024
-
[64]
Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InACL, pages 1601–1611, 2017
2017
-
[65]
Semantic parsing on freebase from question-answer pairs
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. InEMNLP, pages 1533–1544, 2013
2013
-
[66]
CMMLU: Measuring Massive Multitask Language Understanding in Chinese.arXiv:2306.09212, 2023a
Haoran Li et al. CMMLU: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023
-
[67]
Seed-tts: A family of high-quality versatile speech generation models, 2024
Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, J...
2024
-
[68]
Chengyao Wang, Zhisheng Zhong, Bohao Peng, Senqiao Yang, Yuqi Liu, Haokun Gui, Bin Xia, Jingyao Li, Bei Yu, and Jiaya Jia. MGM-Omni: Scaling omni LLMs to personalized long-horizon speech.arXiv preprint arXiv:2509.25131, 2025
-
[69]
Expresso: A benchmark and analysis of discrete expressive speech resynthesis
Tu Anh Nguyen, Wei-Ning Hsu, Antony D’Avirro, Bowen Shi, Itai Gat, Maryam Fazel- Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, and Emmanuel Dupoux. Expresso: A benchmark and analysis of discrete expressive speech resynthesis. InInterspeech, pages 4823–4827, 2023
2023
-
[70]
Emotional speech dataset (ESD): A multi-style emotional speech dataset for speech synthesis and voice conversion
Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li. Emotional speech dataset (ESD): A multi-style emotional speech dataset for speech synthesis and voice conversion. InInterspeech, pages 3361–3365, 2021
2021
-
[71]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023
work page internal anchor Pith review arXiv 2023
-
[72]
Measuring massive multitask language understanding.ICLR, 2021
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.ICLR, 2021
2021
-
[73]
Cmmlu: Measuring massive multitask language understanding in chinese
Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11260–11285, 2024
2024
-
[74]
Challenging big-bench tasks and whether chain-of-thought can solve them
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Compu- tational Linguistics: ACL 2023, pages 13003–13051, 2023
2023
-
[75]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review arXiv 2021
-
[76]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review arXiv 2021
-
[77]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review arXiv 2021
-
[78]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review arXiv 2021
-
[79]
Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities, 2025
Ziwei Zhou, Rui Wang, Zuxuan Wu, and Yu-Gang Jiang. Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities.arXiv preprint arXiv:2505.17862, 2025
-
[80]
arXiv preprint arXiv:2502.04326 (2025)
Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluat- ing real-world omnimodal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.