Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning
Pith reviewed 2026-05-08 04:46 UTC · model grok-4.3
The pith
Omni-o3 formulates audio-visual reasoning as recursive search that shares promising intermediate paths across branches to reduce errors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Omni-o3 introduces a deep nested deduction policy that formulates reasoning as dynamic recursive search with shared prefixes across branches. This enables iterative execution of four atomic actions: expansion, selection, simulation, and backpropagation. The policy is learned first by cold-start supervised fine-tuning on 101K high-quality long-chain trajectories distilled from 3.5M omnimodal samples, then by nested group rollout-driven exploratory reinforcement learning on 18K complex multi-turn samples guided by a multi-step reward model. The resulting model reaches competitive performance on 11 benchmarks and unlocks stronger capabilities in comprehensive audio-visual, visual-centric, and 0
What carries the argument
deep nested deduction policy - formulates reasoning as dynamic recursive search with shared prefixes to enable iterative atomic cognitive actions of expansion, selection, simulation, and backpropagation
If this is right
- Reasoning trajectories can reuse promising intermediate paths instead of remaining isolated, raising exploration efficiency in large cross-modal search spaces.
- Compounding errors decrease because backpropagation can correct earlier branches without restarting entire sequences.
- The framework supports more deliberative handling of complex multi-turn audio-visual interactions than sequential or parallel baselines.
- Two-stage training first installs recursive search patterns then applies targeted reinforcement to deepen them.
- Competitive results appear across comprehensive audio-visual, visual-centric, and audio-centric reasoning benchmarks.
Where Pith is reading between the lines
- The same prefix-sharing mechanism might improve efficiency in text-only or other single-modality reasoning settings where search spaces are also large.
- The recursive structure connects naturally to classic tree-search ideas that avoid redundant computation by caching common prefixes.
- Further increasing the allowed depth of nesting could be tested on longer multi-turn sequences to measure where new failure modes appear.
- The approach suggests that reward models focused on multi-step progress may be more effective than single-step rewards for training deliberative multimodal systems.
Load-bearing premise
That formulating reasoning as dynamic recursive search with shared prefixes will inherently reduce compounding errors and improve exploration efficiency in omnimodal tasks without introducing new failure modes from the recursive structure.
What would settle it
An ablation study in which the prefix-sharing mechanism is removed while retaining all other components, followed by re-evaluation on the same 11 benchmarks; if performance stays the same or improves, the central claim about shared prefixes would be falsified.
Figures
read the original abstract
Omnimodal understanding entails a massive, highly redundant search space of cross-modal interactions, demanding focused and deliberative reasoning. Current reasoning paradigms rely on either sequential step-by-step generation or parallel sample-by-sample rollouts, leading to isolated reasoning trajectories. This inability to share promising intermediate paths severely limits exploration efficiency and causes compounding errors in complex audio-visual tasks. To break this bottleneck, we introduce Omni-o3, a novel framework driven by a deep nested deduction policy. By formulating reasoning as a dynamic recursive search, Omni-o3 inherently shares reasoning prefixes across branches, enabling the iterative execution of four atomic cognitive actions: expansion, selection, simulation, and backpropagation. To empower this framework, we propose a robust two-stage training paradigm: (1) cold-start supervised fine-tuning on 101K high-quality, long-chain trajectories distilled from 3.5M diverse omnimodal samples, enabling necessary recursive search patterns; and (2) nested group rollout-driven exploratory reinforcement learning on 18K complex multi-turn samples, explicitly guided by a novel multi-step reward model to stimulate deep nested reasoning. Extensive experiments demonstrate that Omni-o3 achieves competitive performance across 11 benchmarks, unlocking advanced capabilities in comprehensive audio-visual, visual-centric, and audio-centric reasoning tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Omni-o3, a framework for deliberative omnimodal reasoning that formulates the task as dynamic recursive search with shared reasoning prefixes. It defines four atomic cognitive actions (expansion, selection, simulation, backpropagation) and employs a two-stage training process: cold-start supervised fine-tuning on 101K long-chain trajectories distilled from 3.5M samples, followed by nested group rollout reinforcement learning on 18K complex multi-turn samples guided by a multi-step reward model. The central claim is that this approach achieves competitive performance across 11 benchmarks, enabling advanced audio-visual, visual-centric, and audio-centric reasoning.
Significance. If the empirical claims are substantiated, the recursive prefix-sharing mechanism could meaningfully improve exploration efficiency and reduce compounding errors relative to sequential or parallel rollout baselines in complex multimodal settings. The two-stage training paradigm and explicit multi-step reward model constitute a concrete contribution to deliberative reasoning architectures.
major comments (2)
- [Abstract and Experiments] Abstract and Experiments section: The manuscript asserts 'competitive performance across 11 benchmarks' and 'unlocking advanced capabilities' but supplies no quantitative results, baseline comparisons, error bars, ablation studies, or statistical details. This absence renders the central empirical claim unevaluable and load-bearing for the paper's contribution.
- [Method] Method section (recursive search formulation): The claim that dynamic recursive search with shared prefixes 'inherently' reduces compounding errors and improves efficiency is presented without formal analysis, pseudocode, or discussion of potential new failure modes introduced by the recursive structure (e.g., prefix contamination or backpropagation instability). This assumption underpins the framework's novelty and requires explicit validation or counter-example analysis.
minor comments (1)
- [Abstract and Method] The abstract and method descriptions would benefit from a concise table summarizing the four atomic actions and their inputs/outputs to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate the suggested revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and Experiments section: The manuscript asserts 'competitive performance across 11 benchmarks' and 'unlocking advanced capabilities' but supplies no quantitative results, baseline comparisons, error bars, ablation studies, or statistical details. This absence renders the central empirical claim unevaluable and load-bearing for the paper's contribution.
Authors: We agree that the current manuscript version does not present the specific quantitative results, baseline comparisons, error bars, ablation studies, or statistical details needed to fully evaluate the empirical claims. In the revised version, we will expand the Experiments section with detailed tables reporting performance metrics on all 11 benchmarks, direct comparisons against relevant baselines, error bars, ablation studies on components such as recursive prefix sharing and the two-stage training, and any available statistical analysis. This will substantiate the claims of competitive performance and make the contribution evaluable. revision: yes
-
Referee: [Method] Method section (recursive search formulation): The claim that dynamic recursive search with shared prefixes 'inherently' reduces compounding errors and improves efficiency is presented without formal analysis, pseudocode, or discussion of potential new failure modes introduced by the recursive structure (e.g., prefix contamination or backpropagation instability). This assumption underpins the framework's novelty and requires explicit validation or counter-example analysis.
Authors: We acknowledge that the current presentation relies on the design intuition without sufficient formal support. In the revision, we will add pseudocode for the dynamic recursive search procedure, include a formal analysis of how shared reasoning prefixes reduce compounding errors through joint exploration of promising paths, and explicitly discuss potential new failure modes such as prefix contamination and backpropagation instability along with mitigation strategies and supporting observations from our experiments and training trajectories. This will provide the requested validation. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces an empirical framework for omnimodal reasoning via recursive search with four atomic actions and a two-stage training process (SFT on distilled trajectories followed by RL with a reward model). No equations, derivations, or mathematical reductions are present in the provided text that could equate outputs to inputs by construction. Performance claims rest on experimental benchmarks rather than self-referential definitions, fitted predictions, or load-bearing self-citations. The derivation chain is self-contained as a descriptive architecture plus training recipe.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. In: arXiv (2023) 4
work page 2023
-
[2]
Ahn, J., Verma, R., Lou, R., Liu, D., Zhang, R., Yin, W.: Large language models for mathematical reasoning: Progresses and challenges. In: EACL (2024) 4
work page 2024
-
[3]
Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022) 4
work page 2022
-
[4]
Araujo, E., Bhati, S., Mirza, M.J., Rouditchenko, A., Kingsbury, B., Thomas, S., Feris, R., Glass, J.R., Kuehne, H.: AVRT: Audio-visual reasoning transfer through single-modality teachers. In: arXiv (2025) 5
work page 2025
-
[5]
Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F.M., Weber, G.: Common voice: A massively-multilingual speech corpus. arxiv. In: arXiv (2019) 11
work page 2019
-
[6]
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. In: arXiv (2023) 4
work page 2023
-
[7]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. In: arXiv (2025) 4
work page 2025
-
[8]
Cao, Y., Min, X., Gao, Y., Sun, W., Zhai, G.: Agav-rater: Adapting large mul- timodal model for ai-generated audio-visual quality assessment. In: arXiv (2025) 11
work page 2025
-
[9]
Chen, Y., Yue, X., Zhang, C., Gao, X., Tan, R.T., Li, H.: Voicebench: Benchmark- ing llm-based voice assistants. In: arXiv (2024) 11
work page 2024
-
[10]
Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. In: arXiv (2024) 4
work page 2024
-
[11]
Cheng, J., Ge, Y., Wang, T., Ge, Y., Liao, J., Shan, Y.: Video-holmes: Can mllm think like holmes for complex video reasoning? In: arXiv (2025) 11
work page 2025
-
[12]
Cheng, Z., Hu, J.,Liu, Z., Si, C.,Li, W., Gong,S.: V-star: Benchmarking video-llms on video spatio-temporal reasoning. In: arXiv (2025) 11
work page 2025
-
[13]
In: ICCV (2025) 5 16 Zhang et al
Chowdhury, S., Gani, H., Anand, N., Nag, S., Gao, R., Elhoseiny, M., Khan, S., Manocha, D.: Aurelia: Test-time reasoning distillation in audio-visual llms. In: ICCV (2025) 5 16 Zhang et al
work page 2025
-
[14]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. In: arXiv (2025) 4
work page 2025
-
[15]
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) 4
work page 2021
-
[16]
Duan, H., Hu, Q., Wang, J., Yang, L., Xu, Z., Liu, L., Min, X., Cai, C., Ye, T., Zhang, X., et al.: Finevq: Fine-grained user generated content video quality assessment. In: CVPR (2025) 11
work page 2025
-
[17]
Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: CVPR (2025) 2, 11
work page 2025
-
[18]
Fu, Y., Wang, X., Tian, Y., Zhao, J.: Deep think with confidence. In: arXiv (2025) 2
work page 2025
-
[19]
Fung, P., Bachrach, Y., Celikyilmaz, A., Chaudhuri, K., Chen, D., Chung, W., Dupoux, E., Gong, H., Jégou, H., Lazaric, A., et al.: Embodied ai agents: Modeling the world. In: arXiv (2025) 2
work page 2025
-
[20]
Gao, H., Bao, Y., Tu, X., Zhong, B., Yue, L., Zhang, M.: Apvr: Hour-level long video understanding with adaptive pivot visual information retrieval. AAAI (2026) 2
work page 2026
-
[21]
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. In: arXiv (2025) 4
work page 2025
-
[22]
Han, S., Huang, W., Shi, H., Zhuo, L., Su, X., Zhang, S., Zhou, X., Qi, X., Liao, Y., Liu, S.: Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection. In: CVPR (2025) 11
work page 2025
-
[23]
Hendrycks, D., Song, D., Szegedy, C., Lee, H., Gal, Y., Brynjolfsson, E., Li, S., Zou, A., Levine, L., Han, B., et al.: A definition of agi. In: arXiv (2025) 2
work page 2025
-
[24]
Hong, J., Yan, S., Cai, J., Jiang, X., Hu, Y., Xie, W.: Worldsense: Evaluating real-world omnimodal understanding for multimodal llms. In: arXiv (2025) 11
work page 2025
-
[25]
Hong, J., Zhao, C., Zhu, C., Lu, W., Xu, G., Yu, X.: Deepeyesv2: Toward agentic multimodal model. In: ICLR (2026) 2
work page 2026
-
[26]
Hu, K., Wu, P., Pu, F., Xiao, W., Zhang, Y., Yue, X., Li, B., Liu, Z.: Video- mmmu: Evaluating knowledge acquisition from multi-discipline professional videos. In: arXiv (2025) 11
work page 2025
-
[27]
Kulkarni, Y., Fazli, P.: Avatar: Reinforcement learning to see, hear, and reason over video. In: CVPR (2026) 2, 5
work page 2026
-
[28]
Li, C., Wu, W., Zhang, H., Xia, Y., Mao, S., Dong, L., Vulić, I., Wei, F.: Imagine while reasoning in space: Multimodal visualization-of-thought. In: arXiv (2025) 4
work page 2025
-
[29]
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023) 4
work page 2023
-
[30]
Liu, Y., Cao, Y., Gao, Z., Wang, W., Chen, Z., Wang, W., Tian, H., Lu, L., Zhu, X., Lu, T., et al.: Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity. SCIS67(12), 220103 (2024) 4
work page 2024
-
[31]
SCIS 67(12), 220102 (2024) 4 Omni-o3 17
Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: Ocrbench: on the hidden mystery of ocr in large multimodal models. SCIS 67(12), 220102 (2024) 4 Omni-o3 17
work page 2024
-
[32]
Mittag, G., Naderi, B., Chehadi, A., Möller, S.: Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets. In: arXiv (2021) 11
work page 2021
-
[33]
Ning, X., Lin, Z., Zhou, Z., Wang, Z., Yang, H., Wang, Y.: Skeleton-of-thought: Prompting llms for efficient parallel generation. In: ICLR (2024) 2
work page 2024
-
[34]
OpenAI: Introducing openai o3 and o4-mini.https://openai.com/zh-Hans-CN/ index/introducing-o3-and-o4-mini/(2025) 2
work page 2025
-
[35]
OpenAI: Openai o3-mini.https://openai.com/index/openai-o3-mini(2025) 2, 4
work page 2025
-
[36]
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021) 4
work page 2021
-
[37]
Senocak, A., Oh, T.H., Kim, J., Yang, M.H., Kweon, I.S.: Learning to localize sound source in visual scenes. In: CVPR (2018) 11
work page 2018
-
[38]
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. In: arXiv (2024) 2
work page 2024
-
[39]
Shu, Y., Liu, Z., Zhang, P., Qin, M., Zhou, J., Liang, Z., Huang, T., Zhao, B.: Video-xl: Extra-long vision language model for hour-scale video understanding. In: CVPR (2025) 2
work page 2025
-
[40]
Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hol- lywood in homes: Crowdsourcing data collection for activity understanding. In: ECCV (2016) 11
work page 2016
-
[41]
Spearman, C.: “general intelligence” objectively determined and measured. The American Journal of Psychology15(2), 201–293 (1904) 2
work page 1904
-
[42]
Szot, A., Mazoure, B., Attia, O., Timofeev, A., Agrawal, H., Hjelm, D., Gan, Z., Kira,Z.,Toshev,A.:Frommultimodalllmstogeneralistembodiedagents:Methods and lessons. In: CVPR (2025) 2
work page 2025
-
[43]
Team, Q.: Qvq: To see the world with wisdom.https://qwenlm.github.io/blog/ qvq-72b-preview(2024) 4
work page 2024
-
[44]
Thawakar, O., Dissanayake, D., More, K.P., Thawkar, R., Heakl, A., Ahsan, N., Li, Y., Zumri, I.Z.M., Lahoud, J., Anwer, R.M., et al.: Llamav-o1: Rethinking step-by-step visual reasoning in llms. In: ACL (2025) 4
work page 2025
-
[45]
Tong, S., Brown, E., Wu, P., Woo, S., Middepogu, M., Akula, S.C., Yang, J., Yang, S., Iyer, A., Pan, X., et al.: Cambrian-1: A fully open, vision-centric exploration of multimodal llms. In: NeurIPS (2024) 4
work page 2024
-
[46]
Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: CVPR (2024) 4
work page 2024
-
[47]
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. In: arXiv (2023) 4
work page 2023
-
[48]
Tyagi, U., Kumar, S., Seth, A., Selvakumar, R., Nieto, O., Duraiswami, R., Ghosh, S., Manocha, D.: Mmau: A massive multi-task audio understanding and reasoning benchmark. In: arXiv (2024) 11
work page 2024
-
[49]
Wang, D., Wu, J., Li, J., Yang, D., Chen, X., Zhang, T., Meng, H.: Mmsu: A massive multi-task spoken language understanding and reasoning benchmark. In: arXiv (2025) 11
work page 2025
-
[50]
In: ACL (2025) 11 18 Zhang et al
Wang, S., Yu, W., Chen, X., Tian, X., Zhang, J., Lu, L., Tsao, Y., Yamagishi, J., Wang, Y., Zhang, C.: Qualispeech: A speech quality assessment dataset with natural language reasoning and descriptions. In: ACL (2025) 11 18 Zhang et al
work page 2025
-
[51]
Wang,W.,He,Z.,Hong,W.,Cheng,Y.,Zhang,X.,Qi,J.,Ding,M.,Gu,X.,Huang, S., Xu, B., et al.: Lvbench: An extreme long video understanding benchmark. In: ICCV (2025) 11
work page 2025
-
[52]
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. In: ICLR (2023) 2
work page 2023
-
[53]
Wang, Y., Wang, Z., Xu, B., Du, Y., Lin, K., Xiao, Z., Yue, Z., Ju, J., Zhang, L., Yang, D., Fang, X., He, Z., Luo, Z., Wang, W., Lin, J., Luan, J., Jin, Q.: Time-r1: Post-training large vision language model for temporal video grounding. In: arXiv (2025) 11
work page 2025
-
[54]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: NeurIPS (2022) 2
work page 2022
-
[55]
Wu, P., Xie, S.: V?: Guided visual search as a core mechanism in multimodal llms. In: CVPR (2024) 4
work page 2024
-
[56]
Xiang, V., Snell, C., Gandhi, K., Albalak, A., Singh, A., Blagden, C., Phung, D., Rafailov, R., nathan lile, Mahan, D., Castricato, L., Franken, J.P., Haber, N., Finn, C.: Towards system 2 reasoning in llms: Learning how to think with meta chain-of-thought. In: arXiv (2025) 4
work page 2025
-
[57]
Xing, Z., Hu, X., Fu, C.W., Wang, W., Dai, J., Heng, P.A.: Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning. In: arXiv (2025) 5
work page 2025
-
[58]
Xu, G., Jin, P., Hao, L., Song, Y., Sun, L., Yuan, L.: Llava-cot: Let vision language models reason step-by-step. In: arXiv (2024) 4
work page 2024
-
[59]
Xu, G., Jin, P., Wu, Z., Li, H., Song, Y., Sun, L., Yuan, L.: Llava-cot: Let vision language models reason step-by-step. In: ICCV (2025) 4
work page 2025
-
[60]
Yang, D., Huang, S., Lu, C., Han, X., Zhang, H., Gao, Y., Hu, Y., Zhao, H.: Vript: A video is worth thousands of words. In: NeurIPS (2024) 11
work page 2024
-
[61]
Yang, Q., Yao, S., Chen, W., Fu, S., Bai, D., Zhao, J., Sun, B., Yin, B., Wei, X., Zhou, J.: Humanomniv2: From understanding to omni-modal reasoning with context. In: arXiv (2025) 5, 11
work page 2025
-
[62]
Zhang, R., Zhang, B., Li, Y., Zhang, H., Sun, Z., Gan, Z., Yang, Y., Pang, R., Yang, Y.: Improve vision language model chain-of-thought reasoning. In: arXiv (2024) 4
work page 2024
-
[63]
Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deep- eyes: Incentivizing" thinking with images" via reinforcement learning. In: ICLR (2026) 2
work page 2026
-
[64]
Zhong, H., Zhu, M., Du, Z., Huang, Z., Zhao, C., Liu, M., Wang, W., Chen, H., Shen, C.: Omni-r1: Reinforcement learning for omnimodal reasoning via two- system collaboration. In: arXiv (2025) 2, 5
work page 2025
-
[65]
Zhou, D., Zhang, Y., Wu, J., Zhang, X., Xie, L., Yin, E.: Ave speech dataset: A comprehensive benchmark for multi-modal speech recognition integrating audio, visual, and electromyographic signals. In: arXiv (2025) 11
work page 2025
-
[66]
Zhou, Z., Wang, R., Wu, Z.: Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities. In: arXiv (2025) 5, 11
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.