arxiv: 2604.16617 · v1 · submitted 2026-04-17 · 💻 cs.CV · cs.MM· cs.SD

Recognition: unknown

AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers

Edson Araujo , Saurabhchand Bhati , M. Jehanzeb Mirza , Brian Kingsbury , Samuel Thomas , Rogerio Feris , James R. Glass , Hilde Kuehne

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:59 UTC · model grok-4.3

classification 💻 cs.CV cs.MMcs.SD

keywords audio-visual reasoningsingle-modality teachersreasoning trace generationmultimodal transfer learningsupervised fine-tuningreinforcement learningaudio-visual benchmarks

0 comments

The pith

Single-modality teacher models can generate effective audio-visual reasoning traces for training multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Limited high-quality data hinders transferring reasoning to audio-visual domains. AVRT addresses this by having specialized vision and audio models generate separate reasoning traces, which an LLM then merges into multimodal traces. These traces support a supervised fine-tuning stage followed by reinforcement learning on the target model. The resulting 3B and 7B models set new performance levels on audio-visual benchmarks and also improve on audio-only tasks, indicating benefits from cross-modal training.

Core claim

The central discovery is that merging independent reasoning traces from single-modality teacher models produces high-quality audio-visual reasoning data that, when used in a two-stage SFT and RL training process, enables small models to achieve state-of-the-art results on audio-visual and audio reasoning benchmarks while demonstrating transfer of capabilities to single-modality tasks.

What carries the argument

The LLM-based merger that combines separate vision-reasoning and audio-reasoning traces into unified audio-visual reasoning traces for supervised fine-tuning and reinforcement learning.

If this is right

3B and 7B parameter models reach state-of-the-art performance on OmniBench, DailyOmni for audio-visual reasoning and MMAR for audio-only reasoning.
Cross-modal training with merged traces transfers to improved performance on single-modality tasks.
A training pipeline using only single-modality data sources can produce effective multimodal reasoning models.
Supervised fine-tuning on merged traces followed by reinforcement learning is an effective two-stage adaptation method.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar merging approaches might apply to other modality pairs such as text and video where direct multimodal reasoning data is scarce.
The success suggests that reasoning traces contain modality-independent components that can be recombined.
Testing the method on larger models or different benchmark suites could reveal scalability limits or broader applicability.

Load-bearing premise

The reasoning traces created by merging outputs from separate single-modality teacher models are of high enough quality to drive effective supervised fine-tuning and reinforcement learning in the target model.

What would settle it

A direct comparison experiment where one model is trained on human-annotated audio-visual reasoning data and another on AVRT-generated traces, then evaluated on the same set of audio-visual and audio benchmarks, would determine if the merged traces are comparably effective.

Figures

Figures reproduced from arXiv: 2604.16617 by Brian Kingsbury, Edson Araujo, Hilde Kuehne, James R. Glass, M. Jehanzeb Mirza, Rogerio Feris, Samuel Thomas, Saurabhchand Bhati.

**Figure 1.** Figure 1: Overview of the AVRT pipeline: We first generate reasoning chains from single-modality teacher models that are prompted in the format they were optimized for and, second, leverage an LLM merger as an interface between the teacher models and the resulting reasoning chain to aggregate the information and put it into the target format. The resulting audio-visual traces are then used to train a student model i… view at source ↗

**Figure 2.** Figure 2: Qualitative results of the AVRT-trained model on OmniBench: It shows that the model trained on the respective AVRT-20K data is able to retrieve audio and visual information to answer the question, to combine the two sources of information, and to generate high-quality reasoning chains based on different cues in both modalities. while reducing IFE by only 0.7 points. This demonstrates that performance gains… view at source ↗

read the original abstract

Recent advances in reasoning models have shown remarkable progress in text-based domains, but transferring those capabilities to multimodal settings, e.g., to allow reasoning over audio-visual data, still remains a challenge, in part because of the limited availability of high-quality reasoning data in targeted multimodal combinations. To address this problem, we introduce AVRT, a novel framework that generates high-quality audio-visual reasoning traces from single-modality teacher models. We generate independent vision- and audio-reasoning traces via models specialized to reason over their respective modalities and merge the resulting traces with an LLM merger model. The resulting multimodal traces are used in a supervised fine-tuning (SFT) cold start to adapt the target model to audio-visual reasoning traces first, before training it in a second reinforcement learning stage on larger-scale data. Evaluated on seven audio-visual and audio benchmarks, our 3B and 7B parameter models achieve state-of-the-art results among models of comparable size including OmniBench and DailyOmni for audio-visual and MMAR for audio-only reasoning, showing that cross-modal training also transfers to single-modality tasks and establishing a new training pipeline for multimodal reasoning models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AVRT gets small models to SOTA on audio-visual reasoning benchmarks through single-modality trace generation and LLM merging, but the quality of those merged traces is not directly validated.

read the letter

The paper's main contribution is a framework called AVRT that creates audio-visual reasoning traces by first running specialized vision and audio teacher models separately, then merging their outputs with an LLM. These merged traces go into a two-stage process: supervised fine-tuning on the target model, followed by reinforcement learning on larger data. Their 3B and 7B models achieve state-of-the-art results on OmniBench and DailyOmni for audio-visual tasks and MMAR for audio-only, among models of similar size.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces AVRT, a framework to address limited high-quality audio-visual reasoning data. It generates independent vision-reasoning and audio-reasoning traces from single-modality teacher models, merges them via an LLM merger model to produce audio-visual traces, performs supervised fine-tuning (SFT) as a cold start on these traces, and follows with reinforcement learning (RL) on larger-scale data. The 3B and 7B target models are evaluated on seven audio-visual and audio benchmarks and claimed to achieve state-of-the-art results among comparable-size models on OmniBench, DailyOmni (audio-visual) and MMAR (audio-only), with the additional observation that cross-modal training transfers to single-modality tasks.

Significance. If the central empirical claims hold after validation, the work would be significant for multimodal reasoning research: it offers a practical pipeline for synthesizing reasoning traces without native multimodal teachers, demonstrates positive transfer from audio-visual training to audio-only tasks, and scales to small models (3B/7B). The two-stage SFT-then-RL protocol is a standard but well-motivated design choice here. However, the significance is currently limited by the absence of any quantitative assessment of the merged traces themselves.

major comments (3)

[Method (trace generation and merging)] Method (trace generation and merging paragraph): No quantitative metrics, human ratings, coherence scores, or ablation studies are reported for the quality, logical consistency, or cross-modal alignment of the LLM-merged audio-visual reasoning traces. This is load-bearing for the central claim, because all SOTA results on OmniBench, DailyOmni, and MMAR are attributed to the AVRT pipeline; without evidence that the merger step produces high-quality, non-contradictory traces, performance gains cannot be distinguished from base-model scale or RL data volume.
[Experiments] Experiments section: The manuscript supplies no details on baselines, error bars, statistical significance tests, data exclusion criteria, or exact prompting and training hyperparameters for the seven benchmarks. This prevents verification of the “state-of-the-art among models of comparable size” assertion and makes the transfer-to-single-modality claim difficult to evaluate.
[Results] Results (OmniBench / DailyOmni / MMAR tables): No comparison is presented between models trained on merged traces versus models trained on native audio-visual reasoning traces (or versus an unmerged concatenation baseline). Such an ablation would be required to isolate the contribution of the LLM merger step.

minor comments (2)

[Abstract] Abstract: The phrase “seven audio-visual and audio benchmarks” is used without listing the full set; an explicit enumeration would improve readability.
[Throughout] Notation: The terms “AV traces,” “multimodal traces,” and “reasoning traces” are used interchangeably; consistent terminology throughout would reduce ambiguity.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive and detailed review. We have carefully considered each major comment and will revise the manuscript to improve clarity, add necessary validations, and enhance experimental details. Our point-by-point responses are provided below.

read point-by-point responses

Referee: Method (trace generation and merging paragraph): No quantitative metrics, human ratings, coherence scores, or ablation studies are reported for the quality, logical consistency, or cross-modal alignment of the LLM-merged audio-visual reasoning traces. This is load-bearing for the central claim, because all SOTA results on OmniBench, DailyOmni, and MMAR are attributed to the AVRT pipeline; without evidence that the merger step produces high-quality, non-contradictory traces, performance gains cannot be distinguished from base-model scale or RL data volume.

Authors: We agree that direct assessment of the merged traces would strengthen the central claim. The manuscript currently uses downstream benchmark performance as the primary indicator of trace utility. In the revised version, we will add quantitative validation including: (i) LLM-as-judge coherence and alignment scores on a held-out set of 200 merged traces, (ii) human ratings (on a 1-5 scale for logical consistency and cross-modal alignment) from three annotators on a random sample of 100 traces, and (iii) an ablation comparing the full pipeline against an unmerged concatenation baseline. These additions will help isolate the merger step's contribution while acknowledging that end-to-end gains remain the strongest practical evidence. revision: yes
Referee: Experiments section: The manuscript supplies no details on baselines, error bars, statistical significance tests, data exclusion criteria, or exact prompting and training hyperparameters for the seven benchmarks. This prevents verification of the “state-of-the-art among models of comparable size” assertion and makes the transfer-to-single-modality claim difficult to evaluate.

Authors: We apologize for the insufficient detail in the main text. While some hyperparameter information appears in the appendix, we acknowledge that key elements for reproducibility and verification were not sufficiently highlighted. In the revision, we will expand the Experiments section with: a comprehensive baselines table (including model sizes and training regimes), error bars from at least three random seeds for main results, paired statistical significance tests (e.g., t-tests) for SOTA comparisons, explicit data exclusion criteria, and full prompting templates plus training hyperparameters for all seven benchmarks. This will allow independent verification of the reported results and transfer claims. revision: yes
Referee: Results (OmniBench / DailyOmni / MMAR tables): No comparison is presented between models trained on merged traces versus models trained on native audio-visual reasoning traces (or versus an unmerged concatenation baseline). Such an ablation would be required to isolate the contribution of the LLM merger step.

Authors: We note that the core motivation of AVRT is the scarcity of high-quality native audio-visual reasoning traces at scale; no such large-scale native dataset exists for direct comparison, which is why the single-modality teacher + merger approach was developed. We will therefore add an ablation against an unmerged concatenation baseline (vision and audio traces simply concatenated without LLM merging) to isolate the merger's benefit. We will also expand the discussion section to explicitly state the limitation regarding native traces and why such a comparison is not currently feasible. revision: partial

standing simulated objections not resolved

Direct comparison to models trained on native audio-visual reasoning traces, as no such large-scale high-quality data is available

Circularity Check

0 steps flagged

No circularity: purely empirical framework with external benchmarks

full rationale

The paper describes a training pipeline that generates separate vision and audio reasoning traces from single-modality teacher models, merges them with an LLM, and applies the resulting traces first for SFT then for RL. All reported results consist of direct performance numbers on external benchmarks (OmniBench, DailyOmni, MMAR and others). No equations, fitted parameters, or derivations appear; no self-citation is invoked to justify a uniqueness theorem or to define the core output; and no step renames a known result or smuggles an ansatz. The central claims therefore rest on observable outcomes against independent test sets rather than on any internal reduction to the method's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach relies on pre-existing single-modality teacher models and an LLM merger whose quality assumptions are unstated.

pith-pipeline@v0.9.0 · 5537 in / 1323 out tokens · 62091 ms · 2026-05-10T07:59:08.332691+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 21 canonical work pages · 10 internal anchors

[1]

In: CVPR (2026)

Bousselham, W., Kuehne, H., Schmid, C.: Vold: Reasoning transfer from llms to vision-language models via on-policy distillation. In: CVPR (2026)

2026
[2]

In: ICML (2025)

Busbridge, D., Shidani, A., Weers, F., Ramapuram, J., Littwin, E., Webb, R.: Distillation scaling laws. In: ICML (2025)

2025
[3]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Cheng, Z., Leng, S., Zhang, H., Xin, Y., Li, X., Chen, G., Zhu, Y., Zhang, W., Luo, Z., Zhao, D., Bing, L.: Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476 (2024)

work page internal anchor Pith review arXiv 2024
[4]

In: ICCV (2025)

Chowdhury, S., Gani, H., Anand, N., Nag, S., Gao, R., Elhoseiny, M., Khan, S., Manocha, D.: Aurelia: Test-time reasoning distillation in audio-visual llms. In: ICCV (2025)

2025
[5]

In: European Conference on Computer Vision (ECCV) (2024)

Chowdhury, S., Nag, S., Dasgupta, S., Chen, J., Elhoseiny, M., Gao, R., Manocha, D.: Meerkat: Audio-visual large language model for grounding in space and time. In: European Conference on Computer Vision (ECCV) (2024)

2024
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

In: CVPR (2025)

Dong,Y.,Liu,Z.,Sun,H.L.,Yang,J.,Hu,W.,Rao,Y.,Liu,Z.:Insight-v:Exploring long-chain visual reasoning with multimodal large language models. In: CVPR (2025)

2025
[9]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) 16 Araujo et al

Du, H., Li, G., Zhou, C., Zhang, C., Zhao, A., Hu, D.: Crab: A unified audio-visual scene understanding model with explicit cooperation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) 16 Araujo et al

2025
[10]

In: CVPR (2025)

Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis. In: CVPR (2025)

2025
[11]

In: NeurIPS (2025)

Ghosh, S., Goel, A., Kim, J., Kumar, S., Kong, Z., gil Lee, S., Yang, C.H.H., Du- raiswami, R., Manocha, D., Valle, R., Catanzaro, B.: Audio flamingo 3: Advancing audio intelligence with fully open large audio language models. In: NeurIPS (2025)

2025
[12]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y., Lin, S.: Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749 (2025)

work page internal anchor Pith review arXiv 2025
[13]

OpenAI o1 System Card

Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al.: Openai o1 system card. arXiv preprint arXiv:2412.16720 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026)

Kulkarni, Y., Fazli, P.: Avatar: Reinforcement learning to see, hear, and reason over video. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026)

2026
[15]

CVPR (2022)

Li, G., Wei, Y., Tian, Y., Xu, C., Wen, J.R., Hu, D.: Learning to answer questions in dynamic audio-visual scenarios. CVPR (2022)

2022
[16]

Omnibench: Towards the future of universal omni-language models, 2025

Li, Y., Zhang, G., Ma, Y., Yuan, R., Zhu, K., Guo, H., Liang, Y., Liu, J., Yang, J., Wu, S., Qu, X., Shi, J., Zhang, X., Yang, Z., Wang, X., Zhang, Z., Liu, Z., Benetos, E., Huang, W., Lin, C.: Omnibench: Towards the future of universal omni-language models. arXiv preprint arXiv:2409.15272 (2024)

work page arXiv 2024
[17]

In: Proceedings of the Computer Vision and Pattern Recognition Conference (2025)

Liu, Z., Li, Y., Nguyen, K.D., Zhong, Y., Li, Y.: Pave: Patching and adapting video large language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference (2025)

2025
[18]

Ola: Pushing the frontiers of omni-modal language model.arXiv preprint arXiv:2502.04328, 2025

Liu, Z., Dong, Y., Wang, J., Liu, Z., Hu, W., Lu, J., Rao, Y.: Ola: Pushing the fron- tiers of omni-modal language model with progressive modality alignment. arXiv preprint arXiv:2502.04328 (2025)

work page arXiv 2025
[19]

Av-reasoner: Improving and benchmarking clue-grounded audio-visual counting for mllms, 2025

Lu, L., Chen, G., Li, Z., Liu, Y., Lu, T.: Av-reasoner: Improving and benchmarking clue-grounded audio-visual counting for mllms. arXiv preprint arXiv:2506.05328 (2025)

work page arXiv 2025
[20]

Ma, Z., Ma, Y., Zhu, Y., Yang, C., Chao, Y.W., Xu, R., et al.: Mmar: A challeng- ing benchmark for deep reasoning in speech, audio, music, and their mix. Proc. NeurIPS (2025)

2025
[21]

OpenAI: Hello gpt-4o (2024),https://openai.com/index/hello-gpt-4o

2024
[22]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

Sakshi, S., Tyagi, U., Kumar, S., Seth, A., Selvakumar, R., Nieto, O., Duraiswami, R., Ghosh, S., Manocha, D.: MMAU: A massive multi-task audio understanding and reasoning benchmark. arXiv preprint arXiv:2410.19168 (2024)

work page internal anchor Pith review arXiv 2024
[23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

In: Pro- ceedings of the International Conference on Machine Learning (ICML) (2025)

Sun, G., Yang, Y., Zhuang, J., Tang, C., Li, Y., Li, W., Ma, Z., Zhang, C.: video- SALMONN-o1: Reasoning-enhanced audio-visual large language model. In: Pro- ceedings of the International Conference on Machine Learning (ICML) (2025)

2025
[25]

Tang, Y., Shimada, D., Bi, J., Feng, M., Hua, H., Xu, C.: Empowering llms with pseudo-untrimmedvideosforaudio-visualtemporalunderstanding.In:Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39 (2025)

2025
[26]

Kimi-VL Technical Report

Team, K., Du, A., Yin, B., Xing, B., Qu, B., Wang, B., Chen, C., Zhang, C., Du, C., Wei, C., Wang, C., Zhang, D., Du, D., Wang, D., Yuan, E., Lu, E., Li, F., Sung, F., Wei, G., Lai, G., et al.: Kimi-VL technical report. arXiv preprint arXiv:2504.07491 (2025) Audio-Visual Reasoning Transfer 17

work page internal anchor Pith review arXiv 2025
[27]

Advances in Neural Information Processing Systems36(2023)

Turpin, M., Michael, J., Perez, E., Bowman, S.: Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems36(2023)

2023
[28]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (2025)

Wang, Z., Yoon, J., Yu, S., Islam, M.M., Bertasius, G., Bansal, M.: Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (2025)

2025
[29]

arXiv preprint arXiv:2504.15900 (2025)

Wen, C., Guo, T., Zhao, S., Zou, W., Li, X.: Sari: Structured audio reasoning via curriculum-guided reinforcement learning. arXiv preprint arXiv:2504.15900 (2025)

work page arXiv 2025
[30]

Audio-reasoner: Improving reasoning capability in large audio language models.arXiv preprint arXiv:2503.02318, 2025

Xie, Z., Lin, M., Liu, Z., Wu, P., Yan, S., Miao, C.: Audio-reasoner: Improving rea- soning capability in large audio language models. arXiv preprint arXiv:2503.02318 (2025)

work page arXiv 2025
[31]

arXiv preprint arXiv:2408.05093 (2024)

Xie,Z.:Ordermattersinhallucination:Reasoningorderasbenchmarkandreflexive prompting for large-language-models. arXiv preprint arXiv:2408.05093 (2024)

work page arXiv 2024
[32]

Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning.arXiv preprint arXiv:2505.04623,

Xing, Z., Hu, X., Fu, C.W., Wang, W., Dai, J., Heng, P.A.: Echoink-r1: Explor- ing audio-visual reasoning in multimodal llms via reinforcement learning. arXiv preprint arXiv:2505.04623 (2025)

work page arXiv 2025
[33]

Qwen2.5-Omni Technical Report

Xu, J., Guo, Z., He, J., Hu, H., He, T., Bai, S., Chen, K., Wang, J., Fan, Y., Dang, K., Zhang, B., Wang, X., Chu, Y., Lin, J.: Qwen2.5-omni technical report. arXiv preprint arXiv:2503.20215 (2025)

work page internal anchor Pith review arXiv 2025
[34]

Qwen3-Omni Technical Report

Xu, J., Guo, Z., Hu, H., Chu, Y., Wang, X., He, J., Wang, Y., Shi, X., He, T., Zhu, X., et al.: Qwen3-omni technical report. arXiv preprint arXiv:2509.17765 (2025)

work page internal anchor Pith review arXiv 2025
[35]

In: Proceedings of the 30th ACM International Conference on Multimedia (2022)

Yang, P., Wang, X., Duan, X., Chen, H., Hou, R., Jin, C., Zhu, W.: Avqa: A dataset for audio-visual question answering on videos. In: Proceedings of the 30th ACM International Conference on Multimedia (2022)

2022
[36]

Humanomniv2: From understanding to omni-modal reasoning with context, 2025

Yang, Q., Yao, S., Chen, W., Fu, S., Bai, D., Zhao, J., Sun, B., Yin, B., Wei, X., Zhou, J.: Humanomniv2: From understanding to omni-modal reasoning with context. arXiv preprint arXiv:2506.21277 (2025)

work page arXiv 2025
[37]

arXiv preprint arXiv:2501.15111 , year=

Zhao, J., Yang, Q., Peng, Y., Bai, D., Yao, S., Sun, B., Chen, X., Fu, S., Wei, X., Bo, L., et al.: Humanomni: A large vision-speech language model for human-centric video understanding. arXiv preprint arXiv:2501.15111 (2025)

work page arXiv 2025
[38]

Omni-r1: Reinforcement learning for omnimodal reasoning via two-system collaboration.arXiv preprint arXiv:2505.20256,

Zhong, H., Zhu, M., Du, Z., Huang, Z., Zhao, C., Liu, M., Wang, W., Chen, H., Shen, C.: Omni-r1: Reinforcement learning for omnimodal reasoning via two- system collaboration. arXiv preprint arXiv:2505.20256 (2025)

work page arXiv 2025
[39]

Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities, 2025

Zhou, Z., Wang, R., Wu, Z.: Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities. arXiv preprint arXiv:2505.17862 (2025) Audio-Visual Reasoning Transfer 1 Supplementary Material This supplementary material is organized as follows. Section A presents ad- ditional analyses that further investigate the source and nature of AVR...

work page arXiv 2025
[40]

A multiple-choice question about the video 4

A separate audio file extracted from the same video 3. A multiple-choice question about the video 4. The answer choices
[41]

winner” (“A

The ground-truth correct answer 6. Two reasoning traces (Trace A and Trace B) generated by different AI systems Your task is to carefully watch the video, listen to the audio, read both reasoning traces, and determine which trace is better based on these criteria: Factual Accuracy –- Are the observations about the video and audio content correct? Complete...