pith. sign in

arxiv: 2605.22012 · v1 · pith:TQDOVCMHnew · submitted 2026-05-21 · 💻 cs.CL · cs.CV

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

Pith reviewed 2026-05-22 06:31 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords omnimodal understandinglatent reasoningaudio-visual reasoningchain-of-thoughtmultimodal large language modelsfeature supervisionposition embeddingjoint reasoning
0
0 comments X

The pith

LatentOmni shows that interleaving text with shared audio-visual latent states outperforms explicit text chain-of-thought for joint reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that text-based chain-of-thought weakens audio-visual reasoning by turning continuous signals into discrete language tokens that lose temporal detail and lean on language priors. It proposes instead that a unified latent space can keep dense sensory information intact while still supporting autoregressive generation. LatentOmni puts this into practice by alternating textual steps with audio-visual latent states, supervising those states at the feature level, and adding Omni-Sync Position Embedding to keep audio and visual latents temporally aligned. The approach is trained on a new set of 35K interleaved audio-visual reasoning trajectories. Benchmark results indicate it leads open-source models and beats the text CoT baseline, indicating that latent-space joint reasoning offers a viable route to stronger omnimodal models.

Core claim

LatentOmni is a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. Feature-level supervision aligns the latent reasoning states with task-relevant sensory features, while Omni-Sync Position Embedding maintains temporal consistency between the audio and visual latent sequences. Trained on the LatentOmni-Instruct-35K dataset of audio-visual interleaved trajectories, the model achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline on multiple audio-visual reasoning benchmarks.

What carries the argument

Interleaved textual reasoning steps with aligned audio-visual latent states, using feature-level supervision and Omni-Sync Position Embedding (OSPE) to preserve sensory detail and temporal consistency.

If this is right

  • Feature-level supervision can directly tie intermediate reasoning states to raw sensory features instead of relying on language summaries.
  • Omni-Sync Position Embedding can enforce temporal alignment between separate audio and visual latent streams during generation.
  • Interleaved latent trajectories allow models to keep continuous timing information that text tokens normally discard.
  • Training on audio-visual interleaved reasoning data transfers to better performance on downstream joint reasoning benchmarks.
  • Latent-space reasoning remains compatible with standard autoregressive decoding while reducing compression loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same interleaving pattern could be tested on additional modalities such as depth or thermal data if comparable latent encoders exist.
  • Reducing dependence on language priors might improve robustness in domains where language cues are sparse or misleading.
  • Scaling the latent supervision to longer sequences could reveal whether the temporal consistency benefits persist at larger context lengths.

Load-bearing premise

A single unified latent space can preserve enough dense audio-visual information to support accurate joint reasoning while remaining compatible with autoregressive text generation.

What would settle it

An experiment that shows an explicit text CoT baseline matching or exceeding LatentOmni accuracy on the same benchmarks after equal training data and compute would indicate that the latent-space advantage does not hold.

Figures

Figures reproduced from arXiv: 2605.22012 by Bingyin Mei, Bohan Zeng, Bozhou Li, Chengzhuo Tong, Daili Hua, Fangcheng Fu, Hao Liang, Jialing Liu, Junbo Niu, Pengfei Wan, Tianyu Guo, Wentao Zhang, Xiaochen Ma, Yang Shi, Yifan Dai, Yiyan Ji, Yuanxing Zhang, Yue Ding, Yuran Wang, Yushuo Guan, Zhenhua Wu.

Figure 1
Figure 1. Figure 1: Comparison between LatentOmni and the Explicit Text CoT baseline (detailed in 4.1). [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of LatentOmni. Left: the model alternates between textual generation and latent [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Construction pipeline of LatentOmni-Instruct-35K. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation studies on latent configurations across three benchmarks, specifically evaluating [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompt used to synthesize a single complex, open-ended AVQA pair from temporally [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt used to synthesize a single complex, multiple-choice AVQA pair with grounded [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt used to evaluate the intrinsic quality and modality dependency of synthesized [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt used to synthesize a concise video caption focusing exclusively on observable [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt used to synthesize a concise audio caption detailing identifiable sounds and speech [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt used to fuse the segment-level video caption and audio caption. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt used to refine fragmented segment captions by cross-referencing full captions to [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt used to synthesize interleaved reasoning trajectories with explicit segment citations [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: LatentOmni example: AV Event Alignment. LatentOmni accurately anchors task-relevant [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: LatentOmni example: Inference. LatentOmni accurately anchors task-relevant audio [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: LatentOmni example: Reasoning. LatentOmni accurately anchors task-relevant audio [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
read the original abstract

Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose \textbf{LatentOmni}, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct \textbf{LatentOmni-Instruct-35K}, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes LatentOmni, a cross-modal reasoning framework for omnimodal understanding that interleaves textual reasoning steps with audio-visual latent states in a unified latent space. It introduces feature-level supervision to align latent states with sensory features, Omni-Sync Position Embedding (OSPE) for temporal consistency, and the LatentOmni-Instruct-35K dataset of interleaved reasoning trajectories. The central claim is that this latent-space approach outperforms explicit text-based chain-of-thought (CoT) by preserving dense sensory information, with comprehensive evaluations showing best-in-class results among open-source models on audio-visual reasoning benchmarks.

Significance. If the gains are shown to arise specifically from latent-space joint reasoning rather than dataset or supervision changes, the work would provide a concrete alternative to text CoT compression in MLLMs and support the hypothesis that unified latent spaces better maintain temporal and sensory grounding for tasks requiring fine-grained audio-visual evidence.

major comments (2)
  1. [Evaluation] Evaluation section: The claim that LatentOmni 'consistently outperforms the Explicit Text CoT baseline' and achieves best performance among open-source models is presented without component ablations that hold the LatentOmni-Instruct-35K dataset and auxiliary alignment loss fixed while varying only the reasoning representation (latent states vs. explicit text). This leaves open the possibility that observed improvements are driven by the new data and supervision rather than the proposed latent reasoning mechanism.
  2. [Abstract and §3] Abstract and §3 (framework description): The assertion that a unified latent space 'preserves dense sensory information while remaining compatible with autoregressive generation' is central to the motivation, yet the manuscript does not quantify information loss in the text CoT baseline (e.g., via reconstruction metrics or attention analysis) nor demonstrate that OSPE and feature-level supervision are the minimal additions needed to realize this preservation.
minor comments (2)
  1. [§3] Notation for Omni-Sync Position Embedding (OSPE) is introduced without an explicit equation or pseudocode showing how it synchronizes audio and visual latent positions relative to textual tokens.
  2. [Dataset section] The construction details of LatentOmni-Instruct-35K (trajectory generation process, quality filtering, and split statistics) are referenced but would benefit from a dedicated table or appendix for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important ways to strengthen the isolation of our core contribution. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The claim that LatentOmni 'consistently outperforms the Explicit Text CoT baseline' and achieves best performance among open-source models is presented without component ablations that hold the LatentOmni-Instruct-35K dataset and auxiliary alignment loss fixed while varying only the reasoning representation (latent states vs. explicit text). This leaves open the possibility that observed improvements are driven by the new data and supervision rather than the proposed latent reasoning mechanism.

    Authors: We agree that a more tightly controlled ablation is needed to isolate the effect of latent versus explicit-text reasoning. In the current experiments the Explicit Text CoT baseline is trained on the same LatentOmni-Instruct-35K trajectories (converted to text) but without the auxiliary alignment loss applied to latent states. In the revision we will add a new ablation that trains a text-based model on the identical dataset while adapting the auxiliary alignment loss to operate on text embeddings, thereby holding both data and supervision fixed and varying only the reasoning representation. Results will be reported in Section 4. revision: yes

  2. Referee: [Abstract and §3] Abstract and §3 (framework description): The assertion that a unified latent space 'preserves dense sensory information while remaining compatible with autoregressive generation' is central to the motivation, yet the manuscript does not quantify information loss in the text CoT baseline (e.g., via reconstruction metrics or attention analysis) nor demonstrate that OSPE and feature-level supervision are the minimal additions needed to realize this preservation.

    Authors: We acknowledge that direct quantification of information preservation would strengthen the motivational claim. Although performance gains on tasks that require fine-grained audio-visual evidence provide indirect support, we will add explicit analysis in the revision: reconstruction error of sensory features from the intermediate reasoning states for both the latent and text-CoT paths, together with attention-map comparisons. Regarding minimality, our existing ablations already isolate the contribution of OSPE and feature-level supervision; we will expand the discussion in §3 to present these components as the key enablers identified through ablation rather than asserting they constitute the absolute minimal set. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results independent of mechanism

full rationale

The paper advances an insight that unified latent space preserves dense sensory information better than explicit text CoT, then introduces a new framework (LatentOmni with feature-level supervision and OSPE), constructs a new dataset (LatentOmni-Instruct-35K), and reports benchmark performance gains over baselines. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would make the central performance claim reduce by construction to prior inputs. The evaluation is presented as external validation rather than tautological output of the model's own definitions or fits. This is the common case of an empirical proposal whose claims rest on new measurements rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper rests on the domain assumption that latent representations preserve more useful information than text tokens for cross-modal reasoning and introduces new technical components without external validation of those components.

axioms (1)
  • domain assumption A unified latent space preserves dense sensory information better than discrete text tokens for joint audio-visual reasoning.
    Explicitly stated as the central insight motivating the framework in the abstract.
invented entities (2)
  • Omni-Sync Position Embedding (OSPE) no independent evidence
    purpose: Maintain temporal consistency between latent audio and visual states during interleaved reasoning.
    Newly proposed component described in the abstract as part of the framework.
  • LatentOmni-Instruct-35K no independent evidence
    purpose: Provide audio-visual interleaved reasoning trajectories to supervise latent-space reasoning.
    Dataset constructed specifically for training the proposed method.

pith-pipeline@v0.9.0 · 5829 in / 1432 out tokens · 52133 ms · 2026-05-22T06:31:41.462254+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 23 internal anchors

  1. [1]

    WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

    J. Hong, S. Yan, J. Cai, X. Jiang, Y . Hu, and W. Xie, “Worldsense: Evaluating real-world omnimodal understanding for multimodal llms,”arXiv preprint arXiv:2502.04326, 2025

  2. [2]

    Mlvu: Benchmarking multi-task long video understanding

    Z. Zhou, R. Wang, Z. Wu, and Y .-G. Jiang, “Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities,”arXiv preprint arXiv:2505.17862, 2025

  3. [3]

    Deep audio-visual learning: A survey,

    H. Zhu, M.-D. Luo, R. Wang, A.-H. Zheng, and R. He, “Deep audio-visual learning: A survey,”Interna- tional Journal of Automation and Computing, vol. 18, no. 3, pp. 351–376, 2021

  4. [4]

    Multimodal machine learning: A survey and taxonomy,

    T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 2, pp. 423–443, 2018

  5. [5]

    Debiasing multimodal large language models via penalization of language priors,

    Y . Zhang, Y . Shi, W. Yu, Q. Wen, X. Wang, W. Yang, Z. Zhang, L. Wang, and R. Jin, “Debiasing multimodal large language models via penalization of language priors,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 4232–4241

  6. [6]

    A survey on multimodal large language models,

    S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,” National Science Review, vol. 11, no. 12, p. nwae403, 2024

  7. [7]

    Avocado: An audiovisual video captioner driven by temporal orchestration,

    X. Chen, Y . Ding, W. Lin, J. Hua, L. Yao, Y . Shi, B. Li, Y . Zhang, Q. Liu, P. Wanet al., “Avocado: An audiovisual video captioner driven by temporal orchestration,”arXiv preprint arXiv:2510.10395, 2025

  8. [8]

    Mm-rlhf: The next step forward in multimodal llm alignment,

    Y .-F. Zhang, T. Yu, H. Tian, C. Fu, P. Li, J. Zeng, W. Xie, Y . Shi, H. Zhang, J. Wuet al., “Mm-rlhf: The next step forward in multimodal llm alignment,”arXiv preprint arXiv:2502.10391, 2025

  9. [9]

    Diadem: Advancing dialogue descriptions in audiovisual video captioning for multimodal large language models,

    X. Chen, W. Lin, J. Hua, L. Yao, Y . Ding, B. Li, B. Zeng, Y . Shi, Q. Liu, Y . Zhanget al., “Diadem: Advancing dialogue descriptions in audiovisual video captioning for multimodal large language models,” arXiv preprint arXiv:2601.19267, 2026

  10. [10]

    Mme- videoocr: Evaluating ocr-based capabilities of multimodal llms in video scenarios,

    Y . Shi, H. Wang, W. Xie, H. Zhang, L. Zhao, Y .-F. Zhang, X. Li, C. Fu, Z. Wen, W. Liuet al., “Mme- videoocr: Evaluating ocr-based capabilities of multimodal llms in video scenarios,”arXiv preprint arXiv:2505.21333, 2025

  11. [11]

    OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

    Y . Ding, Y . Ji, J. Li, X. Liu, X. Chen, J. Wu, B. Li, B. Zeng, Y . Shi, Y . Guanet al., “Omnisift: Modality-asymmetric token compression for efficient omni-modal large language models,”arXiv preprint arXiv:2602.04804, 2026

  12. [12]

    Crosslmm: Decoupling long video sequences from lmms via dual cross-attention mechanisms,

    S. Yan, J. Han, J. Tsai, H. Xue, R. Fang, L. Hong, Z. Guo, and R. Zhang, “Crosslmm: Decoupling long video sequences from lmms via dual cross-attention mechanisms,”arXiv preprint arXiv:2505.17020, 2025

  13. [13]

    Omnivideobench: Towards audio-visual understanding evaluation for omni mllms,

    C. Li, Y . Chen, Y . Ji, J. Xu, Z. Cui, S. Li, Y . Zhang, W. Wang, Z. Song, D. Zhanget al., “Omnivideobench: Towards audio-visual understanding evaluation for omni mllms,”arXiv preprint arXiv:2510.10689, 2025

  14. [14]

    Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning,

    Z. Xing, X. Hu, C.-W. Fu, W. Wang, J. Dai, and P.-A. Heng, “Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning,”arXiv preprint arXiv:2505.04623, 2025

  15. [15]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

  16. [16]

    Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

    Y . Wang, S. Wu, Y . Zhang, S. Yan, Z. Liu, J. Luo, and H. Fei, “Multimodal chain-of-thought reasoning: A comprehensive survey,”arXiv preprint arXiv:2503.12605, 2025

  17. [17]

    Visual cot: Advancing multi- modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,

    H. Shao, S. Qian, H. Xiao, G. Song, Z. Zong, L. Wang, Y . Liu, and H. Li, “Visual cot: Advancing multi- modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,” Advances in Neural Information Processing Systems, vol. 37, pp. 8612–8642, 2024

  18. [18]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,”arXiv preprint arXiv:2302.00923, 2023

  19. [19]

    Qwen2.5-Omni Technical Report

    J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Danget al., “Qwen2. 5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

  20. [20]

    Imagebind: One embedding space to bind them all,

    R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 15 180–15 190. 10

  21. [21]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 19 730–19 742

  22. [22]

    Mavors: Multi-granularity video representation for multimodal large language model,

    Y . Shi, J. Liu, Y . Guan, Z. Wu, Y . Zhang, Z. Wang, W. Lin, J. Hua, Z. Wang, X. Chenet al., “Mavors: Multi-granularity video representation for multimodal large language model,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 10 994–11 003

  23. [23]

    Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

    Y . Wang, B. Zeng, C. Tong, W. Liu, Y . Shi, X. Ma, H. Liang, Y . Zhang, and W. Zhang, “Scone: Bridging composition and distinction in subject-driven image generation via unified understanding-generation modeling,”arXiv preprint arXiv:2512.12675, 2025

  24. [24]

    Audio-reasoner: Improving reasoning capability in large audio language models,

    Z. Xie, M. Lin, Z. Liu, P. Wu, S. Yan, and C. Miao, “Audio-reasoner: Improving reasoning capability in large audio language models,”arXiv preprint arXiv:2503.02318, 2025

  25. [25]

    Audio-cot: Exploring chain-of-thought reasoning in large audio language model,

    Z. Ma, Z. Chen, Y . Wang, E.-S. Chng, and X. Chen, “Audio-cot: Exploring chain-of-thought reasoning in large audio language model,” in2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2025, pp. 1–6

  26. [26]

    Cof-t2i: Video models as pure visual reasoners for text-to-image generation,

    C. Tong, M. Chang, S. Zhang, Y . Wang, C. Liang, Z. Zhao, R. An, B. Zeng, Y . Shi, Y . Daiet al., “Cof-t2i: Video models as pure visual reasoners for text-to-image generation,”arXiv preprint arXiv:2601.10061, 2026

  27. [27]

    Insight-v: Exploring long-chain visual reasoning with multimodal large language models,

    Y . Dong, Z. Liu, H.-L. Sun, J. Yang, W. Hu, Y . Rao, and Z. Liu, “Insight-v: Exploring long-chain visual reasoning with multimodal large language models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 9062–9072

  28. [28]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Mil- licanet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

  29. [29]

    Video-llama: An instruction-tuned audio-visual language model for video understanding,

    H. Zhang, X. Li, and L. Bing, “Video-llama: An instruction-tuned audio-visual language model for video understanding,” inProceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations, 2023, pp. 543–553

  30. [30]

    Multimodal chain of continuous thought for latent-space reasoning in vision- language models,

    T.-H. Pham and C. Ngo, “Multimodal chain of continuous thought for latent-space reasoning in vision- language models,”arXiv preprint arXiv:2508.12587, 2025

  31. [31]

    When modalities conflict: How unimodal reasoning uncertainty governs preference dynamics in mllms,

    Z. Zhang, T. Wang, X. Gong, Y . Shi, H. Wang, D. Wang, and L. Hu, “When modalities conflict: How unimodal reasoning uncertainty governs preference dynamics in mllms,”arXiv preprint arXiv:2511.02243, 2025

  32. [32]

    Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

    Z. Qian, Y . Ma, Z. Ouyang, Z. Wang, Z. Xu, F. Luo, X. Liu, Z. Ge, Y . Guo, and J. Han, “Cognitive pivot points and visual anchoring: Unveiling and rectifying hallucinations in multimodal reasoning models,” arXiv preprint arXiv:2604.10219, 2026

  33. [33]

    Seeing through the chain: Mitigate hallucination in multimodal reasoning models via cot compression and contrastive preference optimization,

    H. Fang, J. Li, J. Kong, T. Zhuang, K. Gao, B. Chen, S.-T. Xia, and Y . Wang, “Seeing through the chain: Mitigate hallucination in multimodal reasoning models via cot compression and contrastive preference optimization,”arXiv preprint arXiv:2602.03380, 2026

  34. [34]

    Thinking with sound: Audio chain-of-thought enables multimodal reasoning in large audio-language models,

    Z. Xiong, Y . Cai, Z. Li, J. Yuan, and Y . Wang, “Thinking with sound: Audio chain-of-thought enables multimodal reasoning in large audio-language models,”arXiv preprint arXiv:2509.21749, 2025

  35. [35]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Z. Su, P. Xia, H. Guo, Z. Liu, Y . Ma, X. Qu, J. Liu, Y . Li, K. Zeng, Z. Yanget al., “Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers,”arXiv preprint arXiv:2506.23918, 2025

  36. [36]

    Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025

    H. Zhang, X. Gu, J. Li, C. Ma, S. Bai, C. Zhang, B. Zhang, Z. Zhou, D. He, and Y . Tang, “Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning,”arXiv preprint arXiv:2508.04416, 2025

  37. [37]

    Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

    S. Yan, J. Tong, H. Xue, X. Tang, Y . Wang, K. Shi, G. Zhang, R. Li, and Y . Zou, “Act wisely: Cultivating meta-cognitive tool use in agentic multimodal models,”arXiv preprint arXiv:2604.08545, 2026

  38. [38]

    Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

    S. S. Kancheti, A. S. Kanade, V . N. Balasubramanian, and T. Ganu, “Chain-of-thought degrades visual spatial reasoning capabilities of multimodal llms,”arXiv preprint arXiv:2604.16060, 2026

  39. [39]

    Think before you speak: Training language models with pause tokens.arXiv preprint arXiv:2310.02226,

    S. Goyal, Z. Ji, A. S. Rawat, A. K. Menon, S. Kumar, and V . Nagarajan, “Think before you speak: Training language models with pause tokens,”arXiv preprint arXiv:2310.02226, 2023. 11

  40. [40]

    Training Large Language Models to Reason in a Continuous Latent Space

    S. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. Weston, and Y . Tian, “Training large language models to reason in a continuous latent space,”arXiv preprint arXiv:2412.06769, 2024

  41. [41]

    Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

    E. Zelikman, G. Harik, Y . Shao, V . Jayasiri, N. Haber, and N. D. Goodman, “Quiet-star: Language models can teach themselves to think before speaking,”arXiv preprint arXiv:2403.09629, 2024

  42. [42]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y . LeCun, M. Assran, and N. Ballas, “Revisiting feature prediction for learning visual representations from video,”arXiv preprint arXiv:2404.08471, 2024

  43. [43]

    Latent Visual Reasoning

    B. Li, X. Sun, J. Liu, Z. Wang, J. Wu, X. Yu, H. Chen, E. Barsoum, M. Chen, and Z. Liu, “Latent visual reasoning,”arXiv preprint arXiv:2509.24251, 2025

  44. [44]

    Monet: Reasoning in latent visual space beyond images and language,

    Q. Wang, Y . Shi, Y . Wang, Y . Zhang, P. Wan, K. Gai, X. Ying, and Y . Wang, “Monet: Reasoning in latent visual space beyond images and language,”arXiv preprint arXiv:2511.21395, 2025

  45. [45]

    Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

    C. Liu, Y . Yang, Y . Fan, Q. Wei, S. Liu, and X. E. Wang, “Reasoning within the mind: Dynamic multimodal interleaving in latent space,”arXiv preprint arXiv:2512.12623, 2025

  46. [46]

    Latent implicit visual rea- soning.arXiv preprint arXiv:2512.21218, 2025

    K. Li, C. Shang, L. Karlinsky, R. Feris, T. Darrell, and R. Herzig, “Latent implicit visual reasoning,”arXiv preprint arXiv:2512.21218, 2025

  47. [47]

    Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens,

    Y . Qin, B. Wei, J. Ge, K. Kallidromitis, S. Fu, T. Darrell, and X. Wang, “Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens,”arXiv preprint arXiv:2511.19418, 2025

  48. [48]

    Towards universal video mllms with attribute-structured and quality-verified instructions,

    Y . Li, H. Zhang, M.-H. Guo, W. Gao, S. Jia, S. Jiao, Q. Hou, and M.-M. Cheng, “Towards universal video mllms with attribute-structured and quality-verified instructions,”arXiv preprint arXiv:2602.13013, 2026

  49. [49]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  50. [50]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    A. Zeng, X. Lv, Q. Zheng, Z. Hou, B. Chen, C. Xie, C. Wang, D. Yin, H. Zeng, J. Zhanget al., “Glm-4.5: Agentic, reasoning, and coding (arc) foundation models,”arXiv preprint arXiv:2508.06471, 2025

  51. [51]

    Lvomnibench: Pioneering long audio-video understanding evaluation for omnimodal llms.arXiv preprint arXiv:2603.19217,

    K. Tao, Y . Zheng, J. Xu, W. Du, K. Shao, H. Wang, X. Chen, X. Jin, J. Zhu, B. Yuet al., “Lvomnibench: Pio- neering long audio-video understanding evaluation for omnimodal llms,”arXiv preprint arXiv:2603.19217, 2026

  52. [52]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Z. Cheng, S. Leng, H. Zhang, Y . Xin, X. Li, G. Chen, Y . Zhu, W. Zhang, Z. Luo, D. Zhaoet al., “Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms,”arXiv preprint arXiv:2406.07476, 2024

  53. [53]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Y . Yao, T. Yu, A. Zhang, C. Wang, J. Cui, H. Zhu, T. Cai, H. Li, W. Zhao, Z. Heet al., “Minicpm-v: A gpt-4v level mllm on your phone,”arXiv preprint arXiv:2408.01800, 2024

  54. [54]

    VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

    C. Fu, H. Lin, X. Wang, Y .-F. Zhang, Y . Shen, X. Liu, H. Cao, Z. Long, H. Gao, K. Liet al., “Vita-1.5: Towards gpt-4o level real-time vision and speech interaction,”arXiv preprint arXiv:2501.01957, 2025

  55. [55]

    Humanomniv2: From understanding to omni-modal reasoning with context,

    Q. Yang, S. Yao, W. Chen, S. Fu, D. Bai, J. Zhao, B. Sun, B. Yin, X. Wei, and J. Zhou, “Humanomniv2: From understanding to omni-modal reasoning with context,”arXiv preprint arXiv:2506.21277, 2025

  56. [56]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  57. [57]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

  58. [58]

    A new era of intelligence with gemini 3,

    S. Pichai, D. Hassabis, and K. Kavukcuoglu, “A new era of intelligence with gemini 3,”Google. URL: https://blog.google/products-and-platforms/products/gemini/gemini-3/, 2025. 12 Appendices A Datasets Details A.1 Caption Database AvoCaDOis a newly curated dataset consisting of 107K high-quality, temporally-aligned audiovisual video captions. It emphasizes ...

  59. [60]

    Open-Ended Generation:Synthesize exactlyonehigh-quality, open-ended question-answer pair that demands complex, multi-step cross-modal reasoning

  60. [61]

    id": "OpenQA_01

    Structured Output:Format the final result strictly as a JSON object containing the question, concise answer, and the specific reasoning type employed. Hard Constraints. • Cross-Modal Information Dependency:The question must strictly rely on the synthesis of both visual and audio information. It must be logically impossible to deduce the answer using only ...

  61. [62]

    Context Comprehension:Thoroughly analyze the provided audio and video captions to understand the synchronized multimodal events and their temporal correlations

  62. [63]

    Multiple-Choice Generation:Synthesize exactlyonehigh-quality multiple-choice question (MCQ) that demands complex, multi-step cross-modal reasoning

  63. [64]

    id": "MCQ_01

    Structured Output:Format the final result strictly as a JSON object containing the question, four distinct options, the correct answer, and the specific reasoning type employed. Hard Constraints. • Cross-Modal Information Dependency:The question must strictly rely on the synthesis of both visual and audio information. It must be logically impossible to de...

  64. [65]

    Objectively evaluate the quality, rigor, and grounding of the provided Question-Answer pair

  65. [66]

    Input Data

    Classify the user’s question into one specific AVQA category AND determine its primary modality dependency. Input Data. •[Standard AV Caption]: {AV_caption} •[Question]: {question} •[Ground Truth Answer]: {answer} Task 1: QA Quality Evaluation (1-5 Scale).Evaluate the provided QA pair across the following 6 dimensions:

  66. [67]

    Context Utilization & Relevance (1-5): Does the question effectively target the provided modality context? (5 = strictly relies on necessary AV information; 1 = ignores context or relies on external general knowledge)

  67. [68]

    Question Difficulty (1-5): How inherently difficult is the question? (5 = highly complex, multi-step reasoning or nuanced integration; 1 = simple, shallow factual lookup)

  68. [69]

    evaluation

    Deductive Requirement (1-5): Does answering the question require genuine logical deduction from the observations? (5 = requires deep step-by-step inference; 1 = pure parroting or trivial text matching). Task 2: Question Classification & Modality.First, classify the question into EXACTLY ONE of the following 10 categories: (1) Audio-Visual Joint Perception...

  69. [70]

    Preserve every single sentence from both the visual caption and audio caption exactly as they appear

  70. [71]

    Do NOT omit or delete any sentence in any way

  71. [72]

    You may reorder the sentences (from both captions) to create a logical and temporally accurate sequence that reflects the video’s events

  72. [73]

    watching

    Ensure the integrated narrative flows naturally in time with the video, aligning visual actions with corresponding sounds or spoken content. Verify before responding: Did I include every sentence from both captions? Figure 10: Prompt used to fuse the segment-level video caption and audio caption. Prompt 7: Segment Caption Refinement Task:Refine a fragment...

  73. [74]

    Construct a step-by-step reasoning chain that actively decides when source segments must be revisited

  74. [75]

    When a segment is needed, cite it using the format [Segment n]

  75. [76]

    Cite at least one segment and at most three segments in total, favoring the earliest or most decisive evidence

  76. [77]

    number two, best soup in Southeast Asia, and also number two,

    Continue the reasoning without explicitly referring back to segment numbers in the narrative text. Final Answer Requirement. End the response with a concise boxed answer in the format \boxed{your final answer here}, with no extra commentary afterward. Reference Output Pattern. Reason -> [Segment n] -> Reason -> ... -> \boxed{final answer} Figure 12: Promp...