pith. machine review for the scientific record. sign in

arxiv: 2604.27393 · v1 · submitted 2026-04-30 · 💻 cs.CL

Recognition: unknown

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:20 UTC · model grok-4.3

classification 💻 cs.CL
keywords omni-modal interactionfull-duplexreal-time streamingmultimodal LLMOmni-Flowproactive behavioredge deploymentvision-language
0
0 comments X

The pith

A 9B model achieves real-time full-duplex omni-modal interaction by aligning vision, audio, and speech on one shared timeline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that conventional turn-based multimodal systems can be replaced by a continuous process in which a model perceives new visual and audio input while it is already generating speech. The authors claim this matters because it removes the separation between listening and responding, letting the model adjust its output on the fly and also start speaking without an explicit prompt. They implement the change through a single framework that places every modality on the same clock, then show the resulting 9B model running on edge hardware while matching much larger systems on vision-language tasks.

Core claim

MiniCPM-o 4.5 performs real-time full-duplex omni-modal interaction. Its Omni-Flow framework places vision streams, audio input, and speech output on the same temporal axis so that perception and generation overlap instead of alternating. This lets the model observe a live scene, listen to speech, and produce its own speech at the same time while also generating proactive remarks drawn from ongoing scene analysis.

What carries the argument

Omni-Flow, a unified streaming framework that aligns omni-modal inputs and outputs along a shared temporal axis, converting turn-based interaction into continuous time-aligned processing.

If this is right

  • The model issues reminders or comments based on continuous live-scene understanding.
  • It reaches near Gemini 2.5 Flash vision-language performance with only 9B parameters.
  • Real-time full-duplex operation runs on edge devices with under 12 GB RAM.
  • Omni-modal understanding and speech quality exceed those of the larger Qwen3-Omni-30B-A3B while using far less compute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Always-on devices could maintain environmental awareness without explicit wake words or repeated cloud round-trips.
  • The same timeline alignment could support natural interruption handling in multi-speaker conversations.
  • Extending the framework to new sensors would let assistants draw on richer ongoing context.
  • Open release at this scale may speed development of responsive personal AI that stays aware of its surroundings.

Load-bearing premise

The shared temporal axis must by itself be enough to produce genuine simultaneous perception and response plus proactive behavior without extra rules or hidden turn detection.

What would settle it

Run a test in which the model is already speaking when the visual scene changes or new audio arrives; if it revises its ongoing speech or inserts a relevant comment within one or two seconds, the central claim holds.

Figures

Figures reproduced from arXiv: 2604.27393 by Bokai Xu, Chaojun Xiao, Chi Chen, Chongyi Wang, Fuwei Huang, Guoyang Zeng, Hanyu Liu, Hongliang Wei, Huiping Liu, Jiancheng Gui, Jie Zhou, Junbo Cui, Kechen Fang, Luoyuan Zhang, Maosong Sun, Moye Chen, Qingxin Gui, Qingzhe Han, Rongkang Wang, Tianchi Cai, Tianran Wang, Tianyu Yu, Weiyue Sun, Wenshuo Ma, Xian Sun, Xu Han, Yankai Lin, Yaqi Zhang, Yingjing Xu, You Li, Yuan Yao, Yuxuan Li, Yuyang Wen, Zhihui He, Zhiyuan Liu, Zhuo Lin.

Figure 1
Figure 1. Figure 1: Evaluation results on diverse capabilities. MiniCPM-o 4.5 achieves state-of-the-art open-source vision-language performance at its scale, approaching Gemini 2.5 Flash. It also surpasses Qwen3-Omni-30B-A3B in omni-modal capabilities and speech generation quality. Abstract Recent progress in multimodal large language models (MLLMs) has brought AI ca￾pabilities from static offline data processing to real-time… view at source ↗
Figure 2
Figure 2. Figure 2: Evolution of AI interaction paradigms. AI interaction have progressed from text-only to multimodal understanding and omni live streaming. MiniCPM-o 4.5 advances this trajectory toward more human-like full-duplex interaction by enabling simultaneous perception and response. 1 Introduction Progress in multimodal large language models (MLLMs) has enabled increasingly rich interaction over images, speech, vide… view at source ↗
Figure 3
Figure 3. Figure 3: From turn-based interaction to full-duplex streaming. Existing interaction paradigms separate perception and response as alternating phases, leading to blocked information flow and passive behavior. In contrast, MiniCPM-o 4.5 continuously perceives incoming multimodal streams while speaking, allowing the model to update its response in real time and act proactively. For better compatibility with existing i… view at source ↗
Figure 4
Figure 4. Figure 4: End-to-end omni-modal architecture of MiniCPM-o 4.5. Modality encoders, the LLM backbone, and speech decoders are connected through token-level hidden states in an end-to-end trainable architecture, with multimodal input and output streams aligned on a shared millisecond-level timeline for full-duplex streaming interaction. Audio Encoding. A Whisper Medium [8] encoder (0.3B) encodes input audio in a chunk-… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of streaming speech generation strategies. Existing methods either (a) maintain a large text lead or (b) rely on a fixed text-speech ratio, making the spoken content lag behind the evolving environment. We propose Time-Aligned Interleaving (TAIL), which adaptively interleaves text and speech so that the text generated in each time chunk corresponds to approximately the same duration of speech pl… view at source ↗
Figure 6
Figure 6. Figure 6: Training set accuracy using different length penalty methods. No Length Reward Kimi K1.5-Style Ours 100 200 300 400 0.76 0.77 0.78 0.79 0.80 Training Steps Accuracy Reward response along a shared timeline, the model can better ground its responses in the evolving scene instead of relying on delayed or fragmented visual context view at source ↗
read the original abstract

Recent progress in multimodal large language models (MLLMs) has brought AI capabilities from static offline data processing to real-time streaming interaction, yet they still remain far from human-level multimodal interaction. The key bottlenecks are no longer modality coverage or latency alone, but the interaction paradigm itself. First, perception and response are still separated into alternating phases, preventing models from incorporating new inputs for timely adjustment during generation. Second, most current models remain reactive, responding only to explicit user requests instead of acting proactively in the evolving multimodal environment. We present MiniCPM-o 4.5, our latest effort towards human-like multimodal interaction, which mitigates these gaps by real-time full-duplex omni-modal interaction. It can see, listen, and speak simultaneously in real-time, while also exhibiting proactive behaviors such as issuing reminders or comments based on its continuous understanding of the live scene. The key technique behind MiniCPM-o 4.5 is Omni-Flow, a unified streaming framework that aligns omni-modal inputs and outputs along a shared temporal axis. This formulation converts conventional turn-based interaction into a full-duplex, time-aligned process, enabling simultaneous perception and response and allowing proactive behavior to arise within the same framework. With a total of 9B parameters, MiniCPM-o 4.5 approaches Gemini 2.5 Flash in vision-language capabilities, delivering state-of-the-art open-source performance at its scale. It also surpasses Qwen3-Omni-30B-A3B in omni-modal understanding and delivers better speech generation, with significantly higher computation efficiency. Driven by its efficient architecture design and inference optimization, the model can perform real-time full-duplex omni-modal interaction on edge devices with less than 12GB RAM cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces MiniCPM-o 4.5, a 9B-parameter omni-modal model for real-time full-duplex interaction. It claims the model can simultaneously see, listen, and speak while exhibiting proactive behaviors (e.g., unprompted reminders based on live scene understanding) via the Omni-Flow unified streaming framework, which aligns omni-modal inputs/outputs on a shared temporal axis to convert turn-based interaction into full-duplex time-aligned processing. Additional claims include approaching Gemini 2.5 Flash in vision-language capabilities, surpassing Qwen3-Omni-30B-A3B in omni-modal understanding, better speech generation, and efficient edge-device operation under 12GB RAM.

Significance. If the empirical claims hold under rigorous testing, the work would advance the field by addressing core interaction-paradigm bottlenecks in MLLMs beyond mere latency or modality coverage. The parameter-efficient design and potential for proactive, simultaneous perception-response could influence practical real-time applications. The framework's derivation of full-duplex and proactive capabilities directly from temporal alignment (without auxiliary rules) would be a meaningful contribution if isolated and validated.

major comments (3)
  1. [Abstract] Abstract: The central claims of real-time simultaneous perception-response and proactive behavior arising from Omni-Flow are presented without any reported latency measurements, live-stream evaluations, ablation studies on the temporal alignment mechanism, or error bars; standard sequential omni-modal benchmarks do not directly test whether new inputs during generation trigger adjustments or unprompted comments.
  2. [Abstract] Abstract: The assertion that Omni-Flow 'converts conventional turn-based interaction into a full-duplex, time-aligned process' enabling proactivity 'within the same framework' lacks isolation from possible unstated post-training control logic or hidden turn-taking; the reported SOTA and Gemini-comparable scores appear drawn from conventional benchmarks that do not evaluate this specific capability.
  3. [Abstract] Abstract: The performance comparison stating the model 'approaches Gemini 2.5 Flash in vision-language capabilities' and 'surpasses Qwen3-Omni-30B-A3B' is unsupported by any numerical scores, dataset details, or experimental protocol in the provided text, undermining the scale-efficiency claims.
minor comments (2)
  1. [Abstract] Abstract: The total parameter count of 9B is stated without breakdown across components or confirmation of whether it includes all modality encoders/decoders.
  2. [Abstract] Abstract: The phrase 'significantly higher computation efficiency' is used without reference to specific metrics (e.g., tokens per second or FLOPs) or comparison baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify how the abstract can better convey the empirical support for our claims. We address each point below and will revise the manuscript to improve transparency and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of real-time simultaneous perception-response and proactive behavior arising from Omni-Flow are presented without any reported latency measurements, live-stream evaluations, ablation studies on the temporal alignment mechanism, or error bars; standard sequential omni-modal benchmarks do not directly test whether new inputs during generation trigger adjustments or unprompted comments.

    Authors: We agree the abstract is too concise on these aspects. The full manuscript provides latency measurements, dedicated live-stream evaluations of simultaneous perception and response, ablations isolating the temporal alignment mechanism, and error bars on all quantitative tables. Custom streaming protocols (beyond standard benchmarks) are used to test dynamic input handling and proactivity. We will revise the abstract to reference these elements and the evaluation protocol. revision: yes

  2. Referee: [Abstract] Abstract: The assertion that Omni-Flow 'converts conventional turn-based interaction into a full-duplex, time-aligned process' enabling proactivity 'within the same framework' lacks isolation from possible unstated post-training control logic or hidden turn-taking; the reported SOTA and Gemini-comparable scores appear drawn from conventional benchmarks that do not evaluate this specific capability.

    Authors: Omni-Flow derives full-duplex and proactive behavior directly from continuous temporal alignment of inputs and outputs, without separate turn-taking rules or auxiliary control logic; this is formalized in the method section. The SOTA results incorporate both standard benchmarks and our streaming-specific evaluations. To strengthen isolation, we will add an explicit comparison to a turn-based baseline in the revised manuscript and clarify the benchmark types in the abstract. revision: partial

  3. Referee: [Abstract] Abstract: The performance comparison stating the model 'approaches Gemini 2.5 Flash in vision-language capabilities' and 'surpasses Qwen3-Omni-30B-A3B' is unsupported by any numerical scores, dataset details, or experimental protocol in the provided text, undermining the scale-efficiency claims.

    Authors: The abstract summarizes high-level outcomes; the Experiments section contains the full numerical scores, dataset details, and protocol for the vision-language and omni-modal comparisons. We will update the abstract to include representative numerical values and dataset references drawn from those results. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents MiniCPM-o 4.5 as an empirical system whose full-duplex omni-modal capabilities are attributed to the Omni-Flow framework, which is described as aligning inputs and outputs on a shared temporal axis. No equations, derivations, or first-principles reductions appear in the provided text that would make any claimed capability (simultaneous perception-response or proactive behavior) equivalent to its own inputs by construction. Performance claims are tied to benchmark results and architecture efficiency rather than self-referential definitions or fitted parameters renamed as predictions. The central claims therefore remain self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim depends on the effectiveness of the Omni-Flow framework and associated training procedures, but the abstract provides no explicit free parameters, axioms, or invented entities beyond the model architecture itself.

invented entities (1)
  • Omni-Flow no independent evidence
    purpose: Unified streaming framework that aligns omni-modal inputs and outputs along a shared temporal axis to enable full-duplex interaction
    Presented as the key technical contribution that converts conventional turn-based interaction into a time-aligned process

pith-pipeline@v0.9.0 · 5764 in / 1187 out tokens · 67367 ms · 2026-05-07T09:20:13.619406+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 31 canonical work pages · 12 internal anchors

  1. [1]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. MiniCPM-V: A GPT-4V Level MLLM on Your Phone.ArXiv preprint, abs/2408.01800, 2024

  2. [2]

    Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe, 2025

    Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, Chi Chen, Guoyang Zeng, Yuxuan Li, Ganqu Cui, Ning D...

  3. [3]

    Qwen2.5-VL Technical Report, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Report, 2025

  4. [4]

    Qwen3-vl technical report, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  5. [5]

    Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images

    Zonghao Guo, Ruyi Xu, Yuan Yao, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images. InEuropean Conference on Computer Vision, pages 390–406. Springer, 2024

  6. [6]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, October 2023

  7. [7]

    Qwen3-omni technical report, 2025

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

  8. [8]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Andreas Krause, Emma Brun- skill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Pro- ceedings of the 40th International Conference on Machine Learning, volume 202 ofPro...

  9. [9]

    Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit

    Zhuoyuan Yao, Di Wu 0061, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, and Xin Lei. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. Ininterspeech, volume 2021, pages 4054–4058, 2021

  10. [10]

    Qwen3 Technical Report, 2025

    Qwen Team. Qwen3 Technical Report, 2025

  11. [11]

    Mini-omni: Language models can hear, talk while thinking in streaming, 2024

    Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming, 2024

  12. [12]

    Step-audio 2 technical report, 2025

    Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen, Siyu Chen, Song Yuan, Xuelin Zhang, Yimin Jiang, Yu Zho...

  13. [13]

    Analyzing mitigation strategies for catastrophic forgetting in end-to-end training of spoken language models.arXiv preprint arXiv:2505.17496, 2025

    Chi-Yuan Hsiao, Ke-Han Lu, Kai-Wei Chang, Chih-Kai Yang, Wei-Chih Chen, and Hung-yi Lee. Analyzing mitigation strategies for catastrophic forgetting in end-to-end training of spoken language models.arXiv preprint arXiv:2505.17496, 2025

  14. [14]

    Qwen2.5-omni technical report, 2025

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025

  15. [15]

    Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens, 2024

    Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, and Zhijie Yan. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens, 2024

  16. [16]

    Cosyvoice 2: Scalable streaming speech synthesis with large language models, 2024

    Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, and Jingren Zhou. Cosyvoice 2: Scalable streaming speech synthesis with large language models, 2024

  17. [17]

    A statistical model-based voice activity detection.IEEE signal processing letters, 6(1):1–3, 1999

    Jongseo Sohn, Nam Soo Kim, and Wonyong Sung. A statistical model-based voice activity detection.IEEE signal processing letters, 6(1):1–3, 1999

  18. [18]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

  19. [19]

    Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier.https://github.com/snakers4/silero-vad, 2024

    Silero Team. Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier.https://github.com/snakers4/silero-vad, 2024

  20. [20]

    Robust speech recognition via large-scale weak supervision, 2022

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision, 2022

  21. [21]

    Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition, 2023

    Zhifu Gao, Shiliang Zhang, Ian McLoughlin, and Zhijie Yan. Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition, 2023

  22. [22]

    Leveraging self-supervised learning for speaker diarization, 2024

    Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Diez, and Lukas Burget. Leveraging self-supervised learning for speaker diarization, 2024

  23. [23]

    Music source separation in the waveform domain, 2021

    Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach. Music source separation in the waveform domain, 2021

  24. [24]

    CapsFusion: Rethinking Image-Text Data at Scale

    Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Yue Cao, Xinlong Wang, and Jingjing Liu. CapsFusion: Rethinking Image-Text Data at Scale. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 14022–14032. IEEE, 2024. 16

  25. [25]

    Minicpm4: Ultra-efficient llms on end devices

    MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, et al. Minicpm4: Ultra-efficient llms on end devices. arXiv preprint arXiv:2506.07900, 2025

  26. [26]

    Paddleocr 3.0 technical report, 2025

    Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report, 2025

  27. [27]

    Livecc: Learning video llm with streaming speech transcription at scale, 2025

    Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video llm with streaming speech transcription at scale, 2025

  28. [28]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.ArXiv preprint, abs/2402.03300, 2024

  29. [29]

    Compassverifier: A unified and robust verifier for llms evaluation and outcome reward.arXiv preprint arXiv:2508.03686, 2025

    Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wenwei Zhang, Derek F Wong, Songyang Zhang, and Kai Chen. Compassverifier: A unified and robust verifier for llms evaluation and outcome reward.arXiv preprint arXiv:2508.03686, 2025

  30. [30]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with llms.ArXiv preprint, abs/2501.12599, 2025

  31. [31]

    RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness, 2024

    Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, Xiaocheng Feng, Jun Song, Bo Zheng, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness, 2024

  32. [32]

    OpenCompass: A Universal Evaluation Platform for Foundation Models.https://github.com/open-compass/opencompass, 2023

    OpenCompass Contributors. OpenCompass: A Universal Evaluation Platform for Foundation Models.https://github.com/open-compass/opencompass, 2023

  33. [33]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

  34. [34]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024

  35. [35]

    Are We on the Right Way for Evaluating Large Vision-Language Models? In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are We on the Right Way for Evaluating Large Vision-Language Models? In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information P...

  36. [36]

    MMMU: A massive multi-discipline multimodal understanding and reason- ing benchmark for expert AGI

    Xiang Yue, Yuansheng Ni, Tianyu Zheng, Kai Zhang, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A massive multi-discipline multimodal understanding and reason- ing benchmark for e...

  37. [37]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InProc. of ICLR. OpenReview.net, 2024

  38. [38]

    A Diagram is Worth a Dozen Images

    Aniruddha Kembhavi, Michael Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A Diagram is Worth a Dozen Images. InEuropean Conference on Computer Vision (ECCV), 2016. 17

  39. [39]

    Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask AGI

    Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models toward...

  40. [40]

    Mm-ifengine: Towards multimodal instruction following.ArXiv preprint, abs/2504.07957, 2025

    Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Mm-ifengine: Towards multimodal instruction following.ArXiv preprint, abs/2504.07957, 2025

  41. [41]

    OCRBench: On the hidden mystery of OCR in large multimodal models.Science China Information Sciences, 2024

    Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, and Xiang Bai. OCRBench: On the hidden mystery of OCR in large multimodal models.Science China Information Sciences, 2024

  42. [42]

    TextVQA: Towards VQA requiring reasoning about text

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. TextVQA: Towards VQA requiring reasoning about text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

  43. [43]

    Manmatha, and C

    Minesh Mathew, Dimosthenis Karatzas, R. Manmatha, and C. V . Jawahar. DocVQA: A dataset for VQA on document images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021

  44. [44]

    OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations, 2024

    Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, and Conghui He. OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations, 2024

  45. [45]

    Mantis: Interleaved multi-image instruction tuning

    Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.ArXiv preprint, abs/2405.01483, 2024

  46. [46]

    Muirbench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024

    Fei Wang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. Muirbench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024

  47. [47]

    Mmsi-bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025

    Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. Mmsi-bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025

  48. [48]

    Hallusion- bench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusion- bench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recogni...

  49. [49]

    Aligning large multimodal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf.ArXiv preprint, abs/2309.14525, 2023

  50. [50]

    Video-MME: The First-Ever Comprehen- sive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-MME: The First-Ever Comprehen- sive Evaluation Benchmark of Multi-modal LLMs in Video Analysis. 2025

  51. [51]

    Lvbench: An extreme long video understanding benchmark.CoRR, abs/2406.08035, 2024

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. LVBench: An Extreme Long Video Understanding Benchmark.ArXiv preprint, abs/2406.08035, 2024

  52. [52]

    Mlvu: Benchmarking multi-task long video understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13691–13701, 2025. 18

  53. [53]

    Longvideobench: A benchmark for long- context interleaved video-language understanding

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long- context interleaved video-language understanding. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Pro...

  54. [54]

    MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models, 2024

    Wenyi Hong*, Yean Cheng*, Zhuoyi Yang*, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models, 2024

  55. [55]

    Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline

    Hui Bu, Jiatong Du, Xingyu Na, Bengu Wu, and Hao Zheng. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pages 1–5. IEEE, 2017

  56. [56]

    Aishell-2: Transform- ing mandarin asr research into industrial scale,

    Jiatong Du, Xingyu Na, Xuechen Liu, and Hui Bu. Aishell-2: Transforming mandarin asr research into industrial scale.arXiv preprint arXiv:1808.10583, 2018

  57. [57]

    Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition

    Binbin Zhang, Hang Lv, Haowen Guo, et al. Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. InICASSP, pages 6182–6186. IEEE, 2022

  58. [58]

    Librispeech: An ASR corpus based on public domain audio books

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. InICASSP, pages 5206–5210. IEEE, 2015

  59. [59]

    Gigaspeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio

    Guoguo Chen, Wei Chai, Jiatong Wang, et al. Gigaspeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio. InInterspeech, pages 3670–3674, 2021

  60. [60]

    V oxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation

    Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. V oxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In ACL-IJCNLP, pages 993–1003, 2021

  61. [61]

    ChenWang,MinpengLiao,ZhongqiangHuang,JinliangLu,JunhongWu,YuchenLiu,ChengqingZong,and Jiajun Zhang

    Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, and Juan Pino. CoV oST 2 and massively multilingual speech-to-text translation.arXiv preprint arXiv:2007.10310, 2020

  62. [62]

    MELD: A multimodal multi-party dataset for emotion recognition in conversa- tions

    Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. MELD: A multimodal multi-party dataset for emotion recognition in conversa- tions. InACL, pages 527–536, 2019

  63. [63]

    H., Pasad, A., Casanova, E., Wang, W., Fu, S.-W., Li, J., Chen, Z., Balam, J., et al

    Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T. Tan, and Haizhou Li. V oiceBench: Benchmarking LLM-based voice assistants.arXiv preprint arXiv:2410.17196, 2024

  64. [64]

    Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InACL, pages 1601–1611, 2017

  65. [65]

    Semantic parsing on freebase from question-answer pairs

    Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. InEMNLP, pages 1533–1544, 2013

  66. [66]

    CMMLU: Measuring Massive Multitask Language Understanding in Chinese.arXiv:2306.09212, 2023a

    Haoran Li et al. CMMLU: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023

  67. [67]

    Seed-tts: A family of high-quality versatile speech generation models, 2024

    Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, J...

  68. [68]

    MGM-Omni: Scaling omni LLMs to personalized long-horizon speech.arXiv preprint arXiv:2509.25131, 2025

    Chengyao Wang, Zhisheng Zhong, Bohao Peng, Senqiao Yang, Yuqi Liu, Haokun Gui, Bin Xia, Jingyao Li, Bei Yu, and Jiaya Jia. MGM-Omni: Scaling omni LLMs to personalized long-horizon speech.arXiv preprint arXiv:2509.25131, 2025

  69. [69]

    Expresso: A benchmark and analysis of discrete expressive speech resynthesis

    Tu Anh Nguyen, Wei-Ning Hsu, Antony D’Avirro, Bowen Shi, Itai Gat, Maryam Fazel- Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, and Emmanuel Dupoux. Expresso: A benchmark and analysis of discrete expressive speech resynthesis. InInterspeech, pages 4823–4827, 2023

  70. [70]

    Emotional speech dataset (ESD): A multi-style emotional speech dataset for speech synthesis and voice conversion

    Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li. Emotional speech dataset (ESD): A multi-style emotional speech dataset for speech synthesis and voice conversion. InInterspeech, pages 3361–3365, 2021

  71. [71]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

  72. [72]

    Measuring massive multitask language understanding.ICLR, 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.ICLR, 2021

  73. [73]

    Cmmlu: Measuring massive multitask language understanding in chinese

    Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11260–11285, 2024

  74. [74]

    Challenging big-bench tasks and whether chain-of-thought can solve them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Compu- tational Linguistics: ACL 2023, pages 13003–13051, 2023

  75. [75]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  76. [76]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  77. [77]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  78. [78]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  79. [79]

    Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities, 2025

    Ziwei Zhou, Rui Wang, Zuxuan Wu, and Yu-Gang Jiang. Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities.arXiv preprint arXiv:2505.17862, 2025

  80. [80]

    arXiv preprint arXiv:2502.04326 (2025)

    Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluat- ing real-world omnimodal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025

Showing first 80 references.