Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning

Jufeng Yang; Meng Wang; Pengfei Wan; Weicheng Wang; Wentao Gu; Wenyu Qin; Yongjie Zhu; Zhicheng Zhang

arxiv: 2604.24191 · v1 · submitted 2026-04-27 · 💻 cs.CV

Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning

Zhicheng Zhang , Wentao Gu , Weicheng Wang , Yongjie Zhu , Wenyu Qin , Meng Wang , Pengfei Wan , Jufeng Yang This is my paper

Pith reviewed 2026-05-08 04:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords omnimodal reasoningaudio-visual reasoningrecursive searchnested deductiondeliberative reasoningreinforcement learningmultimodal AIcross-modal interactions

0 comments

The pith

Omni-o3 formulates audio-visual reasoning as recursive search that shares promising intermediate paths across branches to reduce errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current reasoning methods for audio and video tasks generate isolated trajectories either one step at a time or through separate parallel samples. This isolation prevents reuse of good intermediate results and lets mistakes accumulate in the large space of cross-modal interactions. The paper shows that a deep nested deduction policy turns reasoning into dynamic recursive search with shared prefixes, letting the model repeatedly expand options, select among them, simulate outcomes, and backtrack. A two-stage process first teaches the recursive patterns through supervised learning on distilled long chains, then refines them with group-based reinforcement learning and multi-step rewards. The outcome is competitive results across eleven benchmarks for combined audio-visual, visual-only, and audio-only reasoning.

Core claim

Omni-o3 introduces a deep nested deduction policy that formulates reasoning as dynamic recursive search with shared prefixes across branches. This enables iterative execution of four atomic actions: expansion, selection, simulation, and backpropagation. The policy is learned first by cold-start supervised fine-tuning on 101K high-quality long-chain trajectories distilled from 3.5M omnimodal samples, then by nested group rollout-driven exploratory reinforcement learning on 18K complex multi-turn samples guided by a multi-step reward model. The resulting model reaches competitive performance on 11 benchmarks and unlocks stronger capabilities in comprehensive audio-visual, visual-centric, and 0

What carries the argument

deep nested deduction policy - formulates reasoning as dynamic recursive search with shared prefixes to enable iterative atomic cognitive actions of expansion, selection, simulation, and backpropagation

If this is right

Reasoning trajectories can reuse promising intermediate paths instead of remaining isolated, raising exploration efficiency in large cross-modal search spaces.
Compounding errors decrease because backpropagation can correct earlier branches without restarting entire sequences.
The framework supports more deliberative handling of complex multi-turn audio-visual interactions than sequential or parallel baselines.
Two-stage training first installs recursive search patterns then applies targeted reinforcement to deepen them.
Competitive results appear across comprehensive audio-visual, visual-centric, and audio-centric reasoning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prefix-sharing mechanism might improve efficiency in text-only or other single-modality reasoning settings where search spaces are also large.
The recursive structure connects naturally to classic tree-search ideas that avoid redundant computation by caching common prefixes.
Further increasing the allowed depth of nesting could be tested on longer multi-turn sequences to measure where new failure modes appear.
The approach suggests that reward models focused on multi-step progress may be more effective than single-step rewards for training deliberative multimodal systems.

Load-bearing premise

That formulating reasoning as dynamic recursive search with shared prefixes will inherently reduce compounding errors and improve exploration efficiency in omnimodal tasks without introducing new failure modes from the recursive structure.

What would settle it

An ablation study in which the prefix-sharing mechanism is removed while retaining all other components, followed by re-evaluation on the same 11 benchmarks; if performance stays the same or improves, the central claim about shared prefixes would be falsified.

Figures

Figures reproduced from arXiv: 2604.24191 by Jufeng Yang, Meng Wang, Pengfei Wan, Weicheng Wang, Wentao Gu, Wenyu Qin, Yongjie Zhu, Zhicheng Zhang.

**Figure 1.** Figure 1: Paradigm Comparison. We propose Omni-o3 driven by Think-with-Omni . By embedding omni skills in multi-round deduction, it overcomes direct-response and verbal CoT, elevating shallow verbal thinking to deliberative omnimodal reasoning. Abstract. Omnimodal understanding entails a massive, highly redundant search space of cross-modal interactions, demanding focused and deliberative reasoning. Current reasoni… view at source ↗

**Figure 2.** Figure 2: Reasoning paradigms. From top to bottom: CoT, BoN, and our nested reasoning. Let V denote the visual input (e.g., a video clip or a set of images) and Q denote the textual query. The goal of a multimodal reasoning model is to generate an optimal answer y ∗ that maximizes the probability P(y|V, Q). We denote the model (e.g., a Large Vision-Language Model) as πθ, parameterized by θ. We categorize existing… view at source ↗

**Figure 3.** Figure 3: Overall pipeline of the proposed Deep Nested Deduction framework view at source ↗

**Figure 4.** Figure 4: Automated Data Engine for Curating Deliberative Trajectories. view at source ↗

**Figure 5.** Figure 5: Comprehensive statistics of the Omni-o3 training data. view at source ↗

**Figure 6.** Figure 6: Qualitative visualization of Omni-o3’s deliberative reasoning process. view at source ↗

read the original abstract

Omnimodal understanding entails a massive, highly redundant search space of cross-modal interactions, demanding focused and deliberative reasoning. Current reasoning paradigms rely on either sequential step-by-step generation or parallel sample-by-sample rollouts, leading to isolated reasoning trajectories. This inability to share promising intermediate paths severely limits exploration efficiency and causes compounding errors in complex audio-visual tasks. To break this bottleneck, we introduce Omni-o3, a novel framework driven by a deep nested deduction policy. By formulating reasoning as a dynamic recursive search, Omni-o3 inherently shares reasoning prefixes across branches, enabling the iterative execution of four atomic cognitive actions: expansion, selection, simulation, and backpropagation. To empower this framework, we propose a robust two-stage training paradigm: (1) cold-start supervised fine-tuning on 101K high-quality, long-chain trajectories distilled from 3.5M diverse omnimodal samples, enabling necessary recursive search patterns; and (2) nested group rollout-driven exploratory reinforcement learning on 18K complex multi-turn samples, explicitly guided by a novel multi-step reward model to stimulate deep nested reasoning. Extensive experiments demonstrate that Omni-o3 achieves competitive performance across 11 benchmarks, unlocking advanced capabilities in comprehensive audio-visual, visual-centric, and audio-centric reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Omni-o3, a framework for deliberative omnimodal reasoning that formulates the task as dynamic recursive search with shared reasoning prefixes. It defines four atomic cognitive actions (expansion, selection, simulation, backpropagation) and employs a two-stage training process: cold-start supervised fine-tuning on 101K long-chain trajectories distilled from 3.5M samples, followed by nested group rollout reinforcement learning on 18K complex multi-turn samples guided by a multi-step reward model. The central claim is that this approach achieves competitive performance across 11 benchmarks, enabling advanced audio-visual, visual-centric, and audio-centric reasoning.

Significance. If the empirical claims are substantiated, the recursive prefix-sharing mechanism could meaningfully improve exploration efficiency and reduce compounding errors relative to sequential or parallel rollout baselines in complex multimodal settings. The two-stage training paradigm and explicit multi-step reward model constitute a concrete contribution to deliberative reasoning architectures.

major comments (2)

[Abstract and Experiments] Abstract and Experiments section: The manuscript asserts 'competitive performance across 11 benchmarks' and 'unlocking advanced capabilities' but supplies no quantitative results, baseline comparisons, error bars, ablation studies, or statistical details. This absence renders the central empirical claim unevaluable and load-bearing for the paper's contribution.
[Method] Method section (recursive search formulation): The claim that dynamic recursive search with shared prefixes 'inherently' reduces compounding errors and improves efficiency is presented without formal analysis, pseudocode, or discussion of potential new failure modes introduced by the recursive structure (e.g., prefix contamination or backpropagation instability). This assumption underpins the framework's novelty and requires explicit validation or counter-example analysis.

minor comments (1)

[Abstract and Method] The abstract and method descriptions would benefit from a concise table summarizing the four atomic actions and their inputs/outputs to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate the suggested revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The manuscript asserts 'competitive performance across 11 benchmarks' and 'unlocking advanced capabilities' but supplies no quantitative results, baseline comparisons, error bars, ablation studies, or statistical details. This absence renders the central empirical claim unevaluable and load-bearing for the paper's contribution.

Authors: We agree that the current manuscript version does not present the specific quantitative results, baseline comparisons, error bars, ablation studies, or statistical details needed to fully evaluate the empirical claims. In the revised version, we will expand the Experiments section with detailed tables reporting performance metrics on all 11 benchmarks, direct comparisons against relevant baselines, error bars, ablation studies on components such as recursive prefix sharing and the two-stage training, and any available statistical analysis. This will substantiate the claims of competitive performance and make the contribution evaluable. revision: yes
Referee: [Method] Method section (recursive search formulation): The claim that dynamic recursive search with shared prefixes 'inherently' reduces compounding errors and improves efficiency is presented without formal analysis, pseudocode, or discussion of potential new failure modes introduced by the recursive structure (e.g., prefix contamination or backpropagation instability). This assumption underpins the framework's novelty and requires explicit validation or counter-example analysis.

Authors: We acknowledge that the current presentation relies on the design intuition without sufficient formal support. In the revision, we will add pseudocode for the dynamic recursive search procedure, include a formal analysis of how shared reasoning prefixes reduce compounding errors through joint exploration of promising paths, and explicitly discuss potential new failure modes such as prefix contamination and backpropagation instability along with mitigation strategies and supporting observations from our experiments and training trajectories. This will provide the requested validation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces an empirical framework for omnimodal reasoning via recursive search with four atomic actions and a two-stage training process (SFT on distilled trajectories followed by RL with a reward model). No equations, derivations, or mathematical reductions are present in the provided text that could equate outputs to inputs by construction. Performance claims rest on experimental benchmarks rather than self-referential definitions, fitted predictions, or load-bearing self-citations. The derivation chain is self-contained as a descriptive architecture plus training recipe.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no explicit free parameters, axioms, or invented entities are stated. The framework introduces a 'deep nested deduction policy' and 'multi-step reward model' but these are presented as methodological choices rather than new postulated entities with independent evidence.

pith-pipeline@v0.9.0 · 5550 in / 1203 out tokens · 65954 ms · 2026-05-08T04:46:50.571418+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages

[1]

In: arXiv (2023) 4

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. In: arXiv (2023) 4

work page 2023
[2]

In: EACL (2024) 4

Ahn, J., Verma, R., Lou, R., Liu, D., Zhang, R., Yin, W.: Large language models for mathematical reasoning: Progresses and challenges. In: EACL (2024) 4

work page 2024
[3]

In: NeurIPS (2022) 4

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022) 4

work page 2022
[4]

In: arXiv (2025) 5

Araujo, E., Bhati, S., Mirza, M.J., Rouditchenko, A., Kingsbury, B., Thomas, S., Feris, R., Glass, J.R., Kuehne, H.: AVRT: Audio-visual reasoning transfer through single-modality teachers. In: arXiv (2025) 5

work page 2025
[5]

Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F.M., Weber, G.: Common voice: A massively-multilingual speech corpus. arxiv. In: arXiv (2019) 11

work page 2019
[6]

In: arXiv (2023) 4

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. In: arXiv (2023) 4

work page 2023
[7]

In: arXiv (2025) 4

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. In: arXiv (2025) 4

work page 2025
[8]

In: arXiv (2025) 11

Cao, Y., Min, X., Gao, Y., Sun, W., Zhai, G.: Agav-rater: Adapting large mul- timodal model for ai-generated audio-visual quality assessment. In: arXiv (2025) 11

work page 2025
[9]

In: arXiv (2024) 11

Chen, Y., Yue, X., Zhang, C., Gao, X., Tan, R.T., Li, H.: Voicebench: Benchmark- ing llm-based voice assistants. In: arXiv (2024) 11

work page 2024
[10]

In: arXiv (2024) 4

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. In: arXiv (2024) 4

work page 2024
[11]

Cheng, J., Ge, Y., Wang, T., Ge, Y., Liao, J., Shan, Y.: Video-holmes: Can mllm think like holmes for complex video reasoning? In: arXiv (2025) 11

work page 2025
[12]

In: arXiv (2025) 11

Cheng, Z., Hu, J.,Liu, Z., Si, C.,Li, W., Gong,S.: V-star: Benchmarking video-llms on video spatio-temporal reasoning. In: arXiv (2025) 11

work page 2025
[13]

In: ICCV (2025) 5 16 Zhang et al

Chowdhury, S., Gani, H., Anand, N., Nag, S., Gao, R., Elhoseiny, M., Khan, S., Manocha, D.: Aurelia: Test-time reasoning distillation in audio-visual llms. In: ICCV (2025) 5 16 Zhang et al

work page 2025
[14]

In: arXiv (2025) 4

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. In: arXiv (2025) 4

work page 2025
[15]

In: ICLR (2021) 4

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) 4

work page 2021
[16]

In: CVPR (2025) 11

Duan, H., Hu, Q., Wang, J., Yang, L., Xu, Z., Liu, L., Min, X., Cai, C., Ye, T., Zhang, X., et al.: Finevq: Fine-grained user generated content video quality assessment. In: CVPR (2025) 11

work page 2025
[17]

In: CVPR (2025) 2, 11

Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: CVPR (2025) 2, 11

work page 2025
[18]

In: arXiv (2025) 2

Fu, Y., Wang, X., Tian, Y., Zhao, J.: Deep think with confidence. In: arXiv (2025) 2

work page 2025
[19]

In: arXiv (2025) 2

Fung, P., Bachrach, Y., Celikyilmaz, A., Chaudhuri, K., Chen, D., Chung, W., Dupoux, E., Gong, H., Jégou, H., Lazaric, A., et al.: Embodied ai agents: Modeling the world. In: arXiv (2025) 2

work page 2025
[20]

AAAI (2026) 2

Gao, H., Bao, Y., Tu, X., Zhong, B., Yue, L., Zhang, M.: Apvr: Hour-level long video understanding with adaptive pivot visual information retrieval. AAAI (2026) 2

work page 2026
[21]

In: arXiv (2025) 4

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. In: arXiv (2025) 4

work page 2025
[22]

In: CVPR (2025) 11

Han, S., Huang, W., Shi, H., Zhuo, L., Su, X., Zhang, S., Zhou, X., Qi, X., Liao, Y., Liu, S.: Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection. In: CVPR (2025) 11

work page 2025
[23]

In: arXiv (2025) 2

Hendrycks, D., Song, D., Szegedy, C., Lee, H., Gal, Y., Brynjolfsson, E., Li, S., Zou, A., Levine, L., Han, B., et al.: A definition of agi. In: arXiv (2025) 2

work page 2025
[24]

In: arXiv (2025) 11

Hong, J., Yan, S., Cai, J., Jiang, X., Hu, Y., Xie, W.: Worldsense: Evaluating real-world omnimodal understanding for multimodal llms. In: arXiv (2025) 11

work page 2025
[25]

In: ICLR (2026) 2

Hong, J., Zhao, C., Zhu, C., Lu, W., Xu, G., Yu, X.: Deepeyesv2: Toward agentic multimodal model. In: ICLR (2026) 2

work page 2026
[26]

In: arXiv (2025) 11

Hu, K., Wu, P., Pu, F., Xiao, W., Zhang, Y., Yue, X., Li, B., Liu, Z.: Video- mmmu: Evaluating knowledge acquisition from multi-discipline professional videos. In: arXiv (2025) 11

work page 2025
[27]

In: CVPR (2026) 2, 5

Kulkarni, Y., Fazli, P.: Avatar: Reinforcement learning to see, hear, and reason over video. In: CVPR (2026) 2, 5

work page 2026
[28]

In: arXiv (2025) 4

Li, C., Wu, W., Zhang, H., Xia, Y., Mao, S., Dong, L., Vulić, I., Wei, F.: Imagine while reasoning in space: Multimodal visualization-of-thought. In: arXiv (2025) 4

work page 2025
[29]

In: ICML (2023) 4

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023) 4

work page 2023
[30]

SCIS67(12), 220103 (2024) 4

Liu, Y., Cao, Y., Gao, Z., Wang, W., Chen, Z., Wang, W., Tian, H., Lu, L., Zhu, X., Lu, T., et al.: Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity. SCIS67(12), 220103 (2024) 4

work page 2024
[31]

SCIS 67(12), 220102 (2024) 4 Omni-o3 17

Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: Ocrbench: on the hidden mystery of ocr in large multimodal models. SCIS 67(12), 220102 (2024) 4 Omni-o3 17

work page 2024
[32]

In: arXiv (2021) 11

Mittag, G., Naderi, B., Chehadi, A., Möller, S.: Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets. In: arXiv (2021) 11

work page 2021
[33]

In: ICLR (2024) 2

Ning, X., Lin, Z., Zhou, Z., Wang, Z., Yang, H., Wang, Y.: Skeleton-of-thought: Prompting llms for efficient parallel generation. In: ICLR (2024) 2

work page 2024
[34]

OpenAI: Introducing openai o3 and o4-mini.https://openai.com/zh-Hans-CN/ index/introducing-o3-and-o4-mini/(2025) 2

work page 2025
[35]

OpenAI: Openai o3-mini.https://openai.com/index/openai-o3-mini(2025) 2, 4

work page 2025
[36]

In: ICML (2021) 4

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021) 4

work page 2021
[37]

In: CVPR (2018) 11

Senocak, A., Oh, T.H., Kim, J., Yang, M.H., Kweon, I.S.: Learning to localize sound source in visual scenes. In: CVPR (2018) 11

work page 2018
[38]

In: arXiv (2024) 2

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. In: arXiv (2024) 2

work page 2024
[39]

In: CVPR (2025) 2

Shu, Y., Liu, Z., Zhang, P., Qin, M., Zhou, J., Liang, Z., Huang, T., Zhao, B.: Video-xl: Extra-long vision language model for hour-scale video understanding. In: CVPR (2025) 2

work page 2025
[40]

In: ECCV (2016) 11

Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hol- lywood in homes: Crowdsourcing data collection for activity understanding. In: ECCV (2016) 11

work page 2016
[41]

general intelligence

Spearman, C.: “general intelligence” objectively determined and measured. The American Journal of Psychology15(2), 201–293 (1904) 2

work page 1904
[42]

In: CVPR (2025) 2

Szot, A., Mazoure, B., Attia, O., Timofeev, A., Agrawal, H., Hjelm, D., Gan, Z., Kira,Z.,Toshev,A.:Frommultimodalllmstogeneralistembodiedagents:Methods and lessons. In: CVPR (2025) 2

work page 2025
[43]

Team, Q.: Qvq: To see the world with wisdom.https://qwenlm.github.io/blog/ qvq-72b-preview(2024) 4

work page 2024
[44]

In: ACL (2025) 4

Thawakar, O., Dissanayake, D., More, K.P., Thawkar, R., Heakl, A., Ahsan, N., Li, Y., Zumri, I.Z.M., Lahoud, J., Anwer, R.M., et al.: Llamav-o1: Rethinking step-by-step visual reasoning in llms. In: ACL (2025) 4

work page 2025
[45]

In: NeurIPS (2024) 4

Tong, S., Brown, E., Wu, P., Woo, S., Middepogu, M., Akula, S.C., Yang, J., Yang, S., Iyer, A., Pan, X., et al.: Cambrian-1: A fully open, vision-centric exploration of multimodal llms. In: NeurIPS (2024) 4

work page 2024
[46]

In: CVPR (2024) 4

Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: CVPR (2024) 4

work page 2024
[47]

In: arXiv (2023) 4

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. In: arXiv (2023) 4

work page 2023
[48]

In: arXiv (2024) 11

Tyagi, U., Kumar, S., Seth, A., Selvakumar, R., Nieto, O., Duraiswami, R., Ghosh, S., Manocha, D.: Mmau: A massive multi-task audio understanding and reasoning benchmark. In: arXiv (2024) 11

work page 2024
[49]

In: arXiv (2025) 11

Wang, D., Wu, J., Li, J., Yang, D., Chen, X., Zhang, T., Meng, H.: Mmsu: A massive multi-task spoken language understanding and reasoning benchmark. In: arXiv (2025) 11

work page 2025
[50]

In: ACL (2025) 11 18 Zhang et al

Wang, S., Yu, W., Chen, X., Tian, X., Zhang, J., Lu, L., Tsao, Y., Yamagishi, J., Wang, Y., Zhang, C.: Qualispeech: A speech quality assessment dataset with natural language reasoning and descriptions. In: ACL (2025) 11 18 Zhang et al

work page 2025
[51]

In: ICCV (2025) 11

Wang,W.,He,Z.,Hong,W.,Cheng,Y.,Zhang,X.,Qi,J.,Ding,M.,Gu,X.,Huang, S., Xu, B., et al.: Lvbench: An extreme long video understanding benchmark. In: ICCV (2025) 11

work page 2025
[52]

In: ICLR (2023) 2

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. In: ICLR (2023) 2

work page 2023
[53]

In: arXiv (2025) 11

Wang, Y., Wang, Z., Xu, B., Du, Y., Lin, K., Xiao, Z., Yue, Z., Ju, J., Zhang, L., Yang, D., Fang, X., He, Z., Luo, Z., Wang, W., Lin, J., Luan, J., Jin, Q.: Time-r1: Post-training large vision language model for temporal video grounding. In: arXiv (2025) 11

work page 2025
[54]

In: NeurIPS (2022) 2

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: NeurIPS (2022) 2

work page 2022
[55]

In: CVPR (2024) 4

Wu, P., Xie, S.: V?: Guided visual search as a core mechanism in multimodal llms. In: CVPR (2024) 4

work page 2024
[56]

In: arXiv (2025) 4

Xiang, V., Snell, C., Gandhi, K., Albalak, A., Singh, A., Blagden, C., Phung, D., Rafailov, R., nathan lile, Mahan, D., Castricato, L., Franken, J.P., Haber, N., Finn, C.: Towards system 2 reasoning in llms: Learning how to think with meta chain-of-thought. In: arXiv (2025) 4

work page 2025
[57]

In: arXiv (2025) 5

Xing, Z., Hu, X., Fu, C.W., Wang, W., Dai, J., Heng, P.A.: Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning. In: arXiv (2025) 5

work page 2025
[58]

In: arXiv (2024) 4

Xu, G., Jin, P., Hao, L., Song, Y., Sun, L., Yuan, L.: Llava-cot: Let vision language models reason step-by-step. In: arXiv (2024) 4

work page 2024
[59]

In: ICCV (2025) 4

Xu, G., Jin, P., Wu, Z., Li, H., Song, Y., Sun, L., Yuan, L.: Llava-cot: Let vision language models reason step-by-step. In: ICCV (2025) 4

work page 2025
[60]

In: NeurIPS (2024) 11

Yang, D., Huang, S., Lu, C., Han, X., Zhang, H., Gao, Y., Hu, Y., Zhao, H.: Vript: A video is worth thousands of words. In: NeurIPS (2024) 11

work page 2024
[61]

In: arXiv (2025) 5, 11

Yang, Q., Yao, S., Chen, W., Fu, S., Bai, D., Zhao, J., Sun, B., Yin, B., Wei, X., Zhou, J.: Humanomniv2: From understanding to omni-modal reasoning with context. In: arXiv (2025) 5, 11

work page 2025
[62]

In: arXiv (2024) 4

Zhang, R., Zhang, B., Li, Y., Zhang, H., Sun, Z., Gan, Z., Yang, Y., Pang, R., Yang, Y.: Improve vision language model chain-of-thought reasoning. In: arXiv (2024) 4

work page 2024
[63]

thinking with images

Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deep- eyes: Incentivizing" thinking with images" via reinforcement learning. In: ICLR (2026) 2

work page 2026
[64]

In: arXiv (2025) 2, 5

Zhong, H., Zhu, M., Du, Z., Huang, Z., Zhao, C., Liu, M., Wang, W., Chen, H., Shen, C.: Omni-r1: Reinforcement learning for omnimodal reasoning via two- system collaboration. In: arXiv (2025) 2, 5

work page 2025
[65]

In: arXiv (2025) 11

Zhou, D., Zhang, Y., Wu, J., Zhang, X., Xie, L., Yin, E.: Ave speech dataset: A comprehensive benchmark for multi-modal speech recognition integrating audio, visual, and electromyographic signals. In: arXiv (2025) 11

work page 2025
[66]

In: arXiv (2025) 5, 11

Zhou, Z., Wang, R., Wu, Z.: Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities. In: arXiv (2025) 5, 11

work page 2025

[1] [1]

In: arXiv (2023) 4

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. In: arXiv (2023) 4

work page 2023

[2] [2]

In: EACL (2024) 4

Ahn, J., Verma, R., Lou, R., Liu, D., Zhang, R., Yin, W.: Large language models for mathematical reasoning: Progresses and challenges. In: EACL (2024) 4

work page 2024

[3] [3]

In: NeurIPS (2022) 4

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022) 4

work page 2022

[4] [4]

In: arXiv (2025) 5

Araujo, E., Bhati, S., Mirza, M.J., Rouditchenko, A., Kingsbury, B., Thomas, S., Feris, R., Glass, J.R., Kuehne, H.: AVRT: Audio-visual reasoning transfer through single-modality teachers. In: arXiv (2025) 5

work page 2025

[5] [5]

Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F.M., Weber, G.: Common voice: A massively-multilingual speech corpus. arxiv. In: arXiv (2019) 11

work page 2019

[6] [6]

In: arXiv (2023) 4

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. In: arXiv (2023) 4

work page 2023

[7] [7]

In: arXiv (2025) 4

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. In: arXiv (2025) 4

work page 2025

[8] [8]

In: arXiv (2025) 11

Cao, Y., Min, X., Gao, Y., Sun, W., Zhai, G.: Agav-rater: Adapting large mul- timodal model for ai-generated audio-visual quality assessment. In: arXiv (2025) 11

work page 2025

[9] [9]

In: arXiv (2024) 11

Chen, Y., Yue, X., Zhang, C., Gao, X., Tan, R.T., Li, H.: Voicebench: Benchmark- ing llm-based voice assistants. In: arXiv (2024) 11

work page 2024

[10] [10]

In: arXiv (2024) 4

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. In: arXiv (2024) 4

work page 2024

[11] [11]

Cheng, J., Ge, Y., Wang, T., Ge, Y., Liao, J., Shan, Y.: Video-holmes: Can mllm think like holmes for complex video reasoning? In: arXiv (2025) 11

work page 2025

[12] [12]

In: arXiv (2025) 11

Cheng, Z., Hu, J.,Liu, Z., Si, C.,Li, W., Gong,S.: V-star: Benchmarking video-llms on video spatio-temporal reasoning. In: arXiv (2025) 11

work page 2025

[13] [13]

In: ICCV (2025) 5 16 Zhang et al

Chowdhury, S., Gani, H., Anand, N., Nag, S., Gao, R., Elhoseiny, M., Khan, S., Manocha, D.: Aurelia: Test-time reasoning distillation in audio-visual llms. In: ICCV (2025) 5 16 Zhang et al

work page 2025

[14] [14]

In: arXiv (2025) 4

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. In: arXiv (2025) 4

work page 2025

[15] [15]

In: ICLR (2021) 4

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) 4

work page 2021

[16] [16]

In: CVPR (2025) 11

Duan, H., Hu, Q., Wang, J., Yang, L., Xu, Z., Liu, L., Min, X., Cai, C., Ye, T., Zhang, X., et al.: Finevq: Fine-grained user generated content video quality assessment. In: CVPR (2025) 11

work page 2025

[17] [17]

In: CVPR (2025) 2, 11

Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: CVPR (2025) 2, 11

work page 2025

[18] [18]

In: arXiv (2025) 2

Fu, Y., Wang, X., Tian, Y., Zhao, J.: Deep think with confidence. In: arXiv (2025) 2

work page 2025

[19] [19]

In: arXiv (2025) 2

Fung, P., Bachrach, Y., Celikyilmaz, A., Chaudhuri, K., Chen, D., Chung, W., Dupoux, E., Gong, H., Jégou, H., Lazaric, A., et al.: Embodied ai agents: Modeling the world. In: arXiv (2025) 2

work page 2025

[20] [20]

AAAI (2026) 2

Gao, H., Bao, Y., Tu, X., Zhong, B., Yue, L., Zhang, M.: Apvr: Hour-level long video understanding with adaptive pivot visual information retrieval. AAAI (2026) 2

work page 2026

[21] [21]

In: arXiv (2025) 4

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. In: arXiv (2025) 4

work page 2025

[22] [22]

In: CVPR (2025) 11

Han, S., Huang, W., Shi, H., Zhuo, L., Su, X., Zhang, S., Zhou, X., Qi, X., Liao, Y., Liu, S.: Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection. In: CVPR (2025) 11

work page 2025

[23] [23]

In: arXiv (2025) 2

Hendrycks, D., Song, D., Szegedy, C., Lee, H., Gal, Y., Brynjolfsson, E., Li, S., Zou, A., Levine, L., Han, B., et al.: A definition of agi. In: arXiv (2025) 2

work page 2025

[24] [24]

In: arXiv (2025) 11

Hong, J., Yan, S., Cai, J., Jiang, X., Hu, Y., Xie, W.: Worldsense: Evaluating real-world omnimodal understanding for multimodal llms. In: arXiv (2025) 11

work page 2025

[25] [25]

In: ICLR (2026) 2

Hong, J., Zhao, C., Zhu, C., Lu, W., Xu, G., Yu, X.: Deepeyesv2: Toward agentic multimodal model. In: ICLR (2026) 2

work page 2026

[26] [26]

In: arXiv (2025) 11

Hu, K., Wu, P., Pu, F., Xiao, W., Zhang, Y., Yue, X., Li, B., Liu, Z.: Video- mmmu: Evaluating knowledge acquisition from multi-discipline professional videos. In: arXiv (2025) 11

work page 2025

[27] [27]

In: CVPR (2026) 2, 5

Kulkarni, Y., Fazli, P.: Avatar: Reinforcement learning to see, hear, and reason over video. In: CVPR (2026) 2, 5

work page 2026

[28] [28]

In: arXiv (2025) 4

Li, C., Wu, W., Zhang, H., Xia, Y., Mao, S., Dong, L., Vulić, I., Wei, F.: Imagine while reasoning in space: Multimodal visualization-of-thought. In: arXiv (2025) 4

work page 2025

[29] [29]

In: ICML (2023) 4

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023) 4

work page 2023

[30] [30]

SCIS67(12), 220103 (2024) 4

Liu, Y., Cao, Y., Gao, Z., Wang, W., Chen, Z., Wang, W., Tian, H., Lu, L., Zhu, X., Lu, T., et al.: Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity. SCIS67(12), 220103 (2024) 4

work page 2024

[31] [31]

SCIS 67(12), 220102 (2024) 4 Omni-o3 17

Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: Ocrbench: on the hidden mystery of ocr in large multimodal models. SCIS 67(12), 220102 (2024) 4 Omni-o3 17

work page 2024

[32] [32]

In: arXiv (2021) 11

Mittag, G., Naderi, B., Chehadi, A., Möller, S.: Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets. In: arXiv (2021) 11

work page 2021

[33] [33]

In: ICLR (2024) 2

Ning, X., Lin, Z., Zhou, Z., Wang, Z., Yang, H., Wang, Y.: Skeleton-of-thought: Prompting llms for efficient parallel generation. In: ICLR (2024) 2

work page 2024

[34] [34]

OpenAI: Introducing openai o3 and o4-mini.https://openai.com/zh-Hans-CN/ index/introducing-o3-and-o4-mini/(2025) 2

work page 2025

[35] [35]

OpenAI: Openai o3-mini.https://openai.com/index/openai-o3-mini(2025) 2, 4

work page 2025

[36] [36]

In: ICML (2021) 4

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021) 4

work page 2021

[37] [37]

In: CVPR (2018) 11

Senocak, A., Oh, T.H., Kim, J., Yang, M.H., Kweon, I.S.: Learning to localize sound source in visual scenes. In: CVPR (2018) 11

work page 2018

[38] [38]

In: arXiv (2024) 2

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. In: arXiv (2024) 2

work page 2024

[39] [39]

In: CVPR (2025) 2

Shu, Y., Liu, Z., Zhang, P., Qin, M., Zhou, J., Liang, Z., Huang, T., Zhao, B.: Video-xl: Extra-long vision language model for hour-scale video understanding. In: CVPR (2025) 2

work page 2025

[40] [40]

In: ECCV (2016) 11

Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hol- lywood in homes: Crowdsourcing data collection for activity understanding. In: ECCV (2016) 11

work page 2016

[41] [41]

general intelligence

Spearman, C.: “general intelligence” objectively determined and measured. The American Journal of Psychology15(2), 201–293 (1904) 2

work page 1904

[42] [42]

In: CVPR (2025) 2

Szot, A., Mazoure, B., Attia, O., Timofeev, A., Agrawal, H., Hjelm, D., Gan, Z., Kira,Z.,Toshev,A.:Frommultimodalllmstogeneralistembodiedagents:Methods and lessons. In: CVPR (2025) 2

work page 2025

[43] [43]

Team, Q.: Qvq: To see the world with wisdom.https://qwenlm.github.io/blog/ qvq-72b-preview(2024) 4

work page 2024

[44] [44]

In: ACL (2025) 4

Thawakar, O., Dissanayake, D., More, K.P., Thawkar, R., Heakl, A., Ahsan, N., Li, Y., Zumri, I.Z.M., Lahoud, J., Anwer, R.M., et al.: Llamav-o1: Rethinking step-by-step visual reasoning in llms. In: ACL (2025) 4

work page 2025

[45] [45]

In: NeurIPS (2024) 4

Tong, S., Brown, E., Wu, P., Woo, S., Middepogu, M., Akula, S.C., Yang, J., Yang, S., Iyer, A., Pan, X., et al.: Cambrian-1: A fully open, vision-centric exploration of multimodal llms. In: NeurIPS (2024) 4

work page 2024

[46] [46]

In: CVPR (2024) 4

Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: CVPR (2024) 4

work page 2024

[47] [47]

In: arXiv (2023) 4

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. In: arXiv (2023) 4

work page 2023

[48] [48]

In: arXiv (2024) 11

Tyagi, U., Kumar, S., Seth, A., Selvakumar, R., Nieto, O., Duraiswami, R., Ghosh, S., Manocha, D.: Mmau: A massive multi-task audio understanding and reasoning benchmark. In: arXiv (2024) 11

work page 2024

[49] [49]

In: arXiv (2025) 11

Wang, D., Wu, J., Li, J., Yang, D., Chen, X., Zhang, T., Meng, H.: Mmsu: A massive multi-task spoken language understanding and reasoning benchmark. In: arXiv (2025) 11

work page 2025

[50] [50]

In: ACL (2025) 11 18 Zhang et al

Wang, S., Yu, W., Chen, X., Tian, X., Zhang, J., Lu, L., Tsao, Y., Yamagishi, J., Wang, Y., Zhang, C.: Qualispeech: A speech quality assessment dataset with natural language reasoning and descriptions. In: ACL (2025) 11 18 Zhang et al

work page 2025

[51] [51]

In: ICCV (2025) 11

Wang,W.,He,Z.,Hong,W.,Cheng,Y.,Zhang,X.,Qi,J.,Ding,M.,Gu,X.,Huang, S., Xu, B., et al.: Lvbench: An extreme long video understanding benchmark. In: ICCV (2025) 11

work page 2025

[52] [52]

In: ICLR (2023) 2

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. In: ICLR (2023) 2

work page 2023

[53] [53]

In: arXiv (2025) 11

Wang, Y., Wang, Z., Xu, B., Du, Y., Lin, K., Xiao, Z., Yue, Z., Ju, J., Zhang, L., Yang, D., Fang, X., He, Z., Luo, Z., Wang, W., Lin, J., Luan, J., Jin, Q.: Time-r1: Post-training large vision language model for temporal video grounding. In: arXiv (2025) 11

work page 2025

[54] [54]

In: NeurIPS (2022) 2

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: NeurIPS (2022) 2

work page 2022

[55] [55]

In: CVPR (2024) 4

Wu, P., Xie, S.: V?: Guided visual search as a core mechanism in multimodal llms. In: CVPR (2024) 4

work page 2024

[56] [56]

In: arXiv (2025) 4

Xiang, V., Snell, C., Gandhi, K., Albalak, A., Singh, A., Blagden, C., Phung, D., Rafailov, R., nathan lile, Mahan, D., Castricato, L., Franken, J.P., Haber, N., Finn, C.: Towards system 2 reasoning in llms: Learning how to think with meta chain-of-thought. In: arXiv (2025) 4

work page 2025

[57] [57]

In: arXiv (2025) 5

Xing, Z., Hu, X., Fu, C.W., Wang, W., Dai, J., Heng, P.A.: Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning. In: arXiv (2025) 5

work page 2025

[58] [58]

In: arXiv (2024) 4

Xu, G., Jin, P., Hao, L., Song, Y., Sun, L., Yuan, L.: Llava-cot: Let vision language models reason step-by-step. In: arXiv (2024) 4

work page 2024

[59] [59]

In: ICCV (2025) 4

Xu, G., Jin, P., Wu, Z., Li, H., Song, Y., Sun, L., Yuan, L.: Llava-cot: Let vision language models reason step-by-step. In: ICCV (2025) 4

work page 2025

[60] [60]

In: NeurIPS (2024) 11

Yang, D., Huang, S., Lu, C., Han, X., Zhang, H., Gao, Y., Hu, Y., Zhao, H.: Vript: A video is worth thousands of words. In: NeurIPS (2024) 11

work page 2024

[61] [61]

In: arXiv (2025) 5, 11

Yang, Q., Yao, S., Chen, W., Fu, S., Bai, D., Zhao, J., Sun, B., Yin, B., Wei, X., Zhou, J.: Humanomniv2: From understanding to omni-modal reasoning with context. In: arXiv (2025) 5, 11

work page 2025

[62] [62]

In: arXiv (2024) 4

Zhang, R., Zhang, B., Li, Y., Zhang, H., Sun, Z., Gan, Z., Yang, Y., Pang, R., Yang, Y.: Improve vision language model chain-of-thought reasoning. In: arXiv (2024) 4

work page 2024

[63] [63]

thinking with images

Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deep- eyes: Incentivizing" thinking with images" via reinforcement learning. In: ICLR (2026) 2

work page 2026

[64] [64]

In: arXiv (2025) 2, 5

Zhong, H., Zhu, M., Du, Z., Huang, Z., Zhao, C., Liu, M., Wang, W., Chen, H., Shen, C.: Omni-r1: Reinforcement learning for omnimodal reasoning via two- system collaboration. In: arXiv (2025) 2, 5

work page 2025

[65] [65]

In: arXiv (2025) 11

Zhou, D., Zhang, Y., Wu, J., Zhang, X., Xie, L., Yin, E.: Ave speech dataset: A comprehensive benchmark for multi-modal speech recognition integrating audio, visual, and electromyographic signals. In: arXiv (2025) 11

work page 2025

[66] [66]

In: arXiv (2025) 5, 11

Zhou, Z., Wang, R., Wu, Z.: Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities. In: arXiv (2025) 5, 11

work page 2025