arxiv: 2605.07593 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

Hengyi Feng , Hao Liang , Mingrui Chen , Bohan Zeng , Meiyi Qiang , Zhengyang Zhao , Zimo Meng , Zeang Sheng

show 1 more author

Wentao Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-hop reasoningaudio-visual understandinglong-form videomultimodal benchmarkOmniLLMshallucination robustnesstrajectory reasoningcross-modal chaining

0 comments

The pith

TraceAV-Bench shows that top AI models still struggle to chain evidence across long audio-visual videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates TraceAV-Bench to test whether AI systems can follow multi-step reasoning paths that pull sparse evidence from both pictures and sound across extended video durations. Existing tests fall short because they use short clips, single modalities, or single-hop questions, so this benchmark fills the gap with 2,200 questions over 578 videos that average 15 minutes long and require nearly four reasoning hops each. Evaluations of current OmniLLMs find the best closed-source model at 68 percent and the best open-source model at 52 percent on general tasks, with hallucination resistance appearing independent of overall reasoning skill. A sympathetic reader would care because real-world audio-visual understanding depends on exactly this kind of dispersed, cross-modal chaining.

Core claim

TraceAV-Bench is the first benchmark to jointly evaluate multi-hop reasoning over long audio-visual trajectories and multimodal hallucination robustness. It comprises 2,200 rigorously validated multiple-choice questions over 578 long videos totaling 339.5 hours, with each question grounded in an explicit reasoning chain averaging 3.68 hops across a 15.1-minute temporal span. The dataset was built via a three-step semi-automated pipeline followed by strict quality assurance. Evaluation across representative OmniLLMs reveals persistent challenge, with Gemini 3.1 Pro reaching 68.29 percent on general tasks and Ming-Flash-Omni-2.0 reaching 51.70 percent, leaving substantial headroom, while also

What carries the argument

The benchmark dataset itself, built from long videos and explicit multi-hop reasoning chains that span both visual and auditory streams across 4 dimensions and 15 sub-tasks.

If this is right

Current OmniLLMs cannot reliably chain temporally dispersed evidence across audio and visual streams in videos longer than a few minutes.
Robustness to multimodal hallucination must be measured separately from general multimodal reasoning performance.
Progress on long-form audio-visual understanding will require new training or architectural approaches that handle sparse, cross-modal trajectories.
Benchmarks limited to short clips or isolated modalities will continue to overestimate real-world capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applications such as video surveillance or educational content analysis will remain limited until models close the gap shown here.
The observed decoupling between hallucination resistance and reasoning performance suggests separate fine-tuning objectives could be effective.
Similar multi-hop benchmarks may be needed for other long-form multimodal domains such as audio-only or text-video mixtures.

Load-bearing premise

The created questions genuinely demand multi-hop cross-modal evidence chaining and contain no solvable shortcuts or validation errors.

What would settle it

A concrete falsifier would be finding that state-of-the-art models achieve over 90 percent accuracy on TraceAV-Bench while still failing on independent long-video tasks that require the same chaining ability.

Figures

Figures reproduced from arXiv: 2605.07593 by Bohan Zeng, Hao Liang, Hengyi Feng, Meiyi Qiang, Mingrui Chen, Wentao Zhang, Zeang Sheng, Zhengyang Zhao, Zimo Meng.

**Figure 2.** Figure 2: Video category distribution. After executing this filtering process, we finalized a highquality collection of 578 videos, yielding a total of 339.5 hours of continuous audio-visual content ready for complex multi-hop construction. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Statistics analysis of TraceAV-Bench. left: per-sub-task question counts by evaluation dimension (AVR, VR, AR, MH), stacked by hop count (2/3/4/5+). right top: distribution of video durations. right bottom: distribution of the position for all evidence hops along the video timeline. processing ultra-long audio-visual streams. To address these limitations and ensure the logical depth of TraceAV-Bench, we pr… view at source ↗

**Figure 4.** Figure 4: Overview of the TraceAV-Bench data construction pipeline. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Two diagnostic analyses of OmniLLMs on TraceAV-Bench. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Flow of candidate items through the four-stage quality assurance pipeline. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Two diagnostic analyses of OmniLLMs on TraceAV-Bench. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt used in Stage 1 to generate per-minute visual captions while propagating an entity [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt used in Stage 2 to fuse the per-minute visual caption with the corresponding audio [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt used in Stage 3a to slide an LLM judge across the timeline and decide CONTINUE [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Generic scaffold for the Stage 3b trajectory proposer. The same scaffold is instantiated [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Generic scaffold for the Stage 3c question generator. The Question Design block and [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Per-task instantiation for Information Retrieval (IR). Task: Temporal Sequencing (TS) [Stage 3b Core Principle] Identify events that form a non-trivial temporal order or entity-state evolution that cannot be inferred from any single event. Prefer events where an entity’s appearance, role, or state visibly or audibly changes between events. Include both short-range and long-range chains. Avoid trajectories… view at source ↗

**Figure 14.** Figure 14: Per-task instantiation for Temporal Sequencing (TS). Task: Entity Tracking (ET) [Stage 3b Core Principle] Centre each trajectory on a specific entity (person, object, or sound source) that appears in at least two non-adjacent events with meaningful state, role, or relationship changes. Leverage entities whose attribute is “audio-visual” when both modalities are needed to track the entity. Include both nea… view at source ↗

**Figure 15.** Figure 15: Per-task instantiation for Entity Tracking (ET). 26 [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: Per-task instantiation for Forward Causal Reasoning (FCR). Task: Backward Causal Reasoning (BCR) [Stage 3b Core Principle] Identify event pairs or chains where a later event shows an effect or outcome that can only be explained by tracing back to an earlier cause event. The backward causal link must require evidence from both events; neither alone is sufficient to answer “what caused X”. Include both loca… view at source ↗

**Figure 17.** Figure 17: Per-task instantiation for Backward Causal Reasoning (BCR). Task: Cross-Modality Matching (CMM) [Stage 3b Core Principle] Identify events where a specific audio cue (speech phrase, sound, music) and a specific visual element co-occur or are explicitly linked across events. Prefer events containing entities with attribute “audio-visual”. Also identify events where an audio or visual element appears WITHOUT… view at source ↗

**Figure 18.** Figure 18: Per-task instantiation for Cross-Modality Matching (CMM). 27 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: Per-task instantiation for Spatiotemporal Localization (SL). Task: Spatial Reasoning (SR) [Stage 3b Core Principle] Identify events where spatial relationships (position, direction, relative layout) of objects or persons can only be determined by combining visual evidence from multiple events. Prefer events with rich visual entity descriptions. Audio evidence should NOT be required to answer the question.… view at source ↗

**Figure 20.** Figure 20: Per-task instantiation for Spatial Reasoning (SR). 28 [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗

**Figure 21.** Figure 21: Per-task instantiation for Visual Counting (VC). Task: Speech Context (SC) [Stage 3b Core Principle] Identify events where spoken dialogue or narration across multiple events must be combined to answer a question about what was said. Prefer events containing entities with attribute “audio-visual” where the speech component is the primary evidence. Visual information should NOT be required. Include both sh… view at source ↗

**Figure 22.** Figure 22: Per-task instantiation for Speech Context (SC). Task: Environmental Sound (ES) [Stage 3b Core Principle] Identify events where non-human environmental sounds (rain, machinery, crowd, nature sounds) across multiple events must be combined to answer a question. The question must be answerable from audio alone without visual evidence. Include both short-range and long-range chains. [Stage 3c Question Design]… view at source ↗

**Figure 23.** Figure 23: Per-task instantiation for Environmental Sound (ES). 29 [PITH_FULL_IMAGE:figures/full_fig_p029_23.png] view at source ↗

**Figure 24.** Figure 24: Per-task instantiation for Background Music (BM). Task: Visual-to-Audio Deception (V2A) [Stage 3b Core Principle] Identify events with rich visual content (entities with attribute “visual” or “audio-visual”) where the visual scene strongly implies an audio detail that is NOT actually present in the video. The trajectory should provide enough visual evidence to make the hallucinated audio seem plausible, w… view at source ↗

**Figure 25.** Figure 25: Per-task instantiation for Visual-to-Audio Deception (V2A). G.6 Quality Assurance Prompts After generation, every candidate item passes through two LLM-based quality checks: a text-only solver that flags items whose answer can be guessed without the video, and a verifier that audits multi-hop integrity, distractor quality, and answer leakage. G.7 Evaluation Prompt The same minimal prompt is issued to ever… view at source ↗

**Figure 26.** Figure 26: Per-task instantiation for Audio-to-Visual Deception (A2V). Task: Temporal Splicing Fallacy (TSF) [Stage 3b Core Principle] Identify 2 to 4 events from DIFFERENT, non-adjacent time points whose real fragments could be falsely presented as a single coherent sequence. The trajectory should make the splice seem plausible (shared entities or similar settings) while the actual timeline makes the splice impossi… view at source ↗

**Figure 27.** Figure 27: Per-task instantiation for Temporal Splicing Fallacy (TSF). QA-1: Text-only Direct Solver (System Prompt) You are a strict multiple-choice solver. You only see question text and options. You do NOT have access to the video. You are NOT told whether this question is single-choice or multi-choice; infer from wording only. If confidence is insufficient, abstain instead of guessing. Return valid JSON only [P… view at source ↗

**Figure 28.** Figure 28: System prompt for the text-only solver, used to detect items whose answer can be guessed [PITH_FULL_IMAGE:figures/full_fig_p031_28.png] view at source ↗

**Figure 29.** Figure 29: User prompt paired with the text-only solver system prompt in Figure [PITH_FULL_IMAGE:figures/full_fig_p032_29.png] view at source ↗

**Figure 30.** Figure 30: System prompt for the item-level verifier that audits multi-hop integrity, distractor quality, [PITH_FULL_IMAGE:figures/full_fig_p032_30.png] view at source ↗

**Figure 31.** Figure 31: User prompt paired with the verifier system prompt in Figure [PITH_FULL_IMAGE:figures/full_fig_p032_31.png] view at source ↗

**Figure 32.** Figure 32: Unified evaluation prompt issued together with the source video to every candidate model [PITH_FULL_IMAGE:figures/full_fig_p033_32.png] view at source ↗

read the original abstract

Real-world audio-visual understanding requires chaining evidence that is sparse, temporally dispersed, and split across the visual and auditory streams, whereas existing benchmarks largely fail to evaluate this capability. They restrict videos to short clips, isolate modalities, or reduce questions to one-hop perception. We introduce TraceAV-Bench, the first benchmark to jointly evaluate multi-hop reasoning over long audio-visual trajectories and multimodal hallucination robustness. TraceAV-Bench comprises 2,200 rigorously validated multiple-choice questions over 578 long videos, totaling 339.5 hours, spanning 4 evaluation dimensions and 15 sub-tasks. Each question is grounded in an explicit reasoning chain that averages 3.68 hops across a 15.1-minute temporal span. The dataset is built by a three-step semi-automated pipeline followed by a strict quality assurance process. Evaluation of multiple representative OmniLLMs on TraceAV-Bench reveals that the benchmark poses a persistent challenge across all models, with the strongest closed-source model (Gemini 3.1 Pro) reaching only 68.29% on general tasks, and the best open-source model (Ming-Flash-Omni-2.0) reaching 51.70%, leaving substantial headroom. Moreover, we find that robustness to multimodal hallucination is largely decoupled from general multimodal reasoning performance. We anticipate that TraceAV-Bench will stimulate further research toward OmniLLMs that can reason coherently and faithfully over long-form audio-visual content.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TraceAV-Bench gives a new benchmark for multi-hop audio-visual reasoning over long videos, but the evidence that questions truly require full cross-modal chaining is limited to the pipeline description.

read the letter

TraceAV-Bench is the main thing here. The paper builds a dataset of 2,200 multiple-choice questions over 578 videos that total 339.5 hours, with each question tied to an explicit reasoning chain that averages 3.68 hops across a 15-minute span. It also separates out tests for multimodal hallucination robustness and shows that performance on those is largely independent of the general reasoning score. That separation is a clear observation worth noting, and the scale plus the four evaluation dimensions with 15 sub-tasks give it concrete coverage that shorter or single-modality benchmarks lack. The model results are straightforward: Gemini 3.1 Pro tops out at 68.29% on the general tasks while the best open-source model sits at 51.70%, which leaves visible room for improvement. The three-step semi-automated construction followed by strict QA is described in enough detail to let others understand the process. The soft spot is exactly the one the stress-test flagged. The central claim is that these questions demand multi-hop trajectory reasoning across both modalities over long temporal distances. The paper relies on the pipeline and QA to deliver that, but it does not report hop ablations, modality ablations, or shortcut detection experiments that would show models cannot reach the answer with partial evidence or single-modality cues. Without those checks, it is hard to know how much of the reported headroom actually measures the intended capability versus easier shortcuts that survived the QA. This is a benchmark paper aimed at people working on OmniLLMs and long-form multimodal evaluation. Readers who need datasets that move beyond short clips or isolated perception tasks will find the construction details and the task breakdown useful even if they later add their own validation experiments. I would send it to peer review because it fills a documented gap with a usable scale and concrete numbers, and the core construction is reproducible enough to be worth referee scrutiny and possible tightening on the validation side.

Referee Report

3 major / 2 minor

Summary. The paper introduces TraceAV-Bench, a benchmark of 2,200 multiple-choice questions over 578 long audio-visual videos (339.5 hours total). It targets multi-hop trajectory reasoning across audio-visual streams in long temporal contexts (average 15.1 min span, 3.68-hop explicit chains), plus multimodal hallucination robustness. The dataset spans 4 dimensions and 15 sub-tasks, constructed via a three-step semi-automated pipeline with strict quality assurance. Evaluations on OmniLLMs show persistent challenges, with Gemini 3.1 Pro at 68.29% on general tasks and the best open-source model at 51.70%, plus decoupling between hallucination robustness and general reasoning.

Significance. If the questions genuinely isolate multi-hop cross-modal chaining, TraceAV-Bench would fill an important gap left by short-clip or single-hop benchmarks. The scale, explicit hop counts, long durations, and concrete model gaps (including the decoupling finding) are strengths that could drive progress in coherent long-form audio-visual reasoning. The semi-automated pipeline with QA is a positive methodological contribution.

major comments (3)

[§3] §3 (Construction Pipeline): The manuscript describes the three-step semi-automated pipeline and states that each question is tied to an explicit 3.68-hop chain, but reports no hop-ablation, modality-ablation, or shortcut-detection experiments (e.g., providing only partial evidence or single-modality input). This is load-bearing for the central claim that the benchmark evaluates multi-hop trajectory reasoning rather than solvable shortcuts.
[§3.3] §3.3 (Quality Assurance): The strict QA process is mentioned, yet no quantitative validation statistics—rejection rates, inter-annotator agreement, or error analysis—are provided. This weakens support for benchmark quality and the absence of data leaks or validation errors, directly affecting interpretation of the reported model scores.
[§4] §4 (Experiments): While concrete scores are given (e.g., Gemini 3.1 Pro at 68.29%), the evaluation section does not analyze whether accuracy correlates with hop count or temporal span, leaving the multi-hop difficulty claim without direct empirical support.

minor comments (2)

[Table 2] Table 2 (model results): Adding per-sub-task breakdowns or confidence intervals would improve clarity of the performance gaps.
[§2] §2 (Related Work): A few additional citations to recent long-video QA benchmarks would strengthen the novelty positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that additional analyses are needed to more rigorously substantiate the multi-hop and cross-modal claims, and we will incorporate the suggested revisions to strengthen the manuscript.

read point-by-point responses

Referee: §3 (Construction Pipeline): The manuscript describes the three-step semi-automated pipeline and states that each question is tied to an explicit 3.68-hop chain, but reports no hop-ablation, modality-ablation, or shortcut-detection experiments (e.g., providing only partial evidence or single-modality input). This is load-bearing for the central claim that the benchmark evaluates multi-hop trajectory reasoning rather than solvable shortcuts.

Authors: We agree that ablation experiments are necessary to empirically confirm that the benchmark requires multi-hop cross-modal chaining rather than permitting shortcuts. In the revised manuscript, we will add modality-ablation results (audio-only and visual-only inputs) and partial-evidence ablations (providing only subsets of the hop chain) to Section 4. These will demonstrate performance degradation without full chains. While the pipeline explicitly constructs and validates the 3.68-hop chains with long temporal spans, the new experiments will directly address this concern. revision: yes
Referee: §3.3 (Quality Assurance): The strict QA process is mentioned, yet no quantitative validation statistics—rejection rates, inter-annotator agreement, or error analysis—are provided. This weakens support for benchmark quality and the absence of data leaks or validation errors, directly affecting interpretation of the reported model scores.

Authors: We concur that quantitative QA metrics are essential for validating benchmark quality. We will revise §3.3 to include rejection rates during the QA process, inter-annotator agreement statistics (e.g., Fleiss' kappa), and a summary of error analysis categories. These metrics were collected during dataset construction but not reported in the original submission; they will now be added to provide stronger evidence against data leaks or annotation errors. revision: yes
Referee: §4 (Experiments): While concrete scores are given (e.g., Gemini 3.1 Pro at 68.29%), the evaluation section does not analyze whether accuracy correlates with hop count or temporal span, leaving the multi-hop difficulty claim without direct empirical support.

Authors: We thank the referee for highlighting this gap. To provide direct empirical support for the multi-hop difficulty claim, we will expand §4 with new analyses: model accuracy broken down by hop count (e.g., 2-hop vs. 4-hop) and by temporal span bins, along with correlation coefficients between accuracy and these factors. This will be presented in tables and figures to show the expected increase in difficulty with more hops and longer spans. revision: yes

Circularity Check

0 steps flagged

No circularity in benchmark construction or claims

full rationale

The paper constructs TraceAV-Bench through an external-video-based three-step semi-automated pipeline plus independent QA, with questions tied to explicitly authored reasoning chains. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked to derive the core claims about multi-hop requirements or model headroom. Evaluations use separate models on the resulting dataset, keeping creation and measurement distinct. No step reduces a claimed result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the constructed questions accurately capture multi-hop audio-visual reasoning without introducing artifacts from the semi-automated process.

axioms (1)

domain assumption The semi-automated three-step pipeline followed by human QA produces questions that genuinely require and test multi-hop cross-modal reasoning chains
This assumption underpins the validity of the benchmark as stated in the abstract's description of dataset construction.

pith-pipeline@v0.9.0 · 5590 in / 1214 out tokens · 51107 ms · 2026-05-11T01:49:07.148033+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

100 extracted references · 100 canonical work pages · 23 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024

work page 2024
[4]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

A survey of multimodal large language model from a data-centric perspective, arXiv preprint arXiv:2405.16640v2, 2024

Tianyi Bai, Hao Liang, Binwang Wan, Yanran Xu, Xi Li, Shiyu Li, Ling Yang, Bozhou Li, Yifan Wang, Bin Cui, et al. A survey of multimodal large language model from a data-centric perspective.arXiv preprint arXiv:2405.16640, 2024

work page arXiv 2024
[6]

Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai.arXiv preprint arXiv:2512.16676, 2025

Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, et al. Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai.arXiv preprint arXiv:2512.16676, 2025

work page arXiv 2025
[7]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

arXiv preprint arXiv:2412.03555 (2024) 1

Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555, 2024

work page arXiv 2024
[9]

Qwen3-asr technical report

Qwen Team. Qwen3-asr technical report. 2026

work page 2026
[10]

Qwen3-ASR Technical Report

Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, and Junyang Lin. Qwen3-asr technical report, 2026. URLhttps://arxiv.org/abs/2601.21337

work page internal anchor Pith review arXiv 2026
[11]

Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation.arXiv preprint arXiv:2510.24821,

Inclusion AI, Bowen Ma, Cheng Zou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Chenyu Lian, Dandan Zheng, Fudong Wang, Furong Xu, et al. Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation.arXiv preprint arXiv:2510.24821, 2025

work page arXiv 2025
[13]

arXiv preprint arXiv:2501.15111 , year=

Jiaxing Zhao, Qize Yang, Yixing Peng, Detao Bai, Shimin Yao, Boyuan Sun, Xiang Chen, Shenghao Fu, Xihan Wei, Liefeng Bo, et al. Humanomni: A large vision-speech language model for human-centric video understanding.arXiv preprint arXiv:2501.15111, 2025

work page arXiv 2025
[14]

Longcat-flash-omni technical report.ArXiv, abs/2511.00279,

Meituan LongCat Team, Bairui Wang, Bin Xiao, Bo Zhang, Bolin Rong, Borun Chen, Chang Wan, Chao Zhang, Chen Huang, Chen Chen, et al. Longcat-flash-omni technical report.arXiv preprint arXiv:2511.00279, 2025

work page arXiv 2025
[15]

video-salmonn 2: Caption-enhanced audio-visual large language models.arXiv preprint arXiv:2506.15220, 2025

Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video-salmonn 2: Caption-enhanced audio-visual large language models.arXiv preprint arXiv:2506.15220, 2025. 10

work page arXiv 2025
[16]

Baichuan-omni technical report.arXiv preprint arXiv:2410.08565, 2024

Yadong Li, Haoze Sun, Mingan Lin, Tianpeng Li, Guosheng Dong, Tao Zhang, Bowen Ding, Wei Song, Zhenglin Cheng, Yuqi Huo, et al. Baichuan-omni technical report.arXiv preprint arXiv:2410.08565, 2024

work page arXiv 2024
[17]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Onellm: One framework to align all modalities with language

Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xiangyu Yue. Onellm: One framework to align all modalities with language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26584–26595, 2024

work page 2024
[19]

Omn- immi: A comprehensive multi-modal interaction benchmark in streaming video contexts

Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng. Omn- immi: A comprehensive multi-modal interaction benchmark in streaming video contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18925–18935, 2025

work page 2025
[20]

Omnigaia: Towards native omni-modal ai agents, 2026

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong, Jiajie Jin, Hao Wang, Yinuo Wang, Ji-Rong Wen, Yuan Lu, et al. Omnigaia: Towards native omni-modal ai agents. arXiv preprint arXiv:2602.22897, 2026

work page arXiv 2026
[21]

Lvomnibench: Pioneering long audio-video understanding evaluation for omnimodal llms

Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, et al. Lvomnibench: Pioneering long audio-video understanding evaluation for omnimodal llms.arXiv preprint arXiv:2603.19217, 2026

work page arXiv 2026
[22]

Lvbench: An extreme long video understanding benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958– 22967, 2025

work page 2025
[23]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

work page 2025
[24]

Futureomni: Eval- uating future forecasting from omni-modal context for multimodal llms.arXiv preprint arXiv:2601.13836, 2026

Qian Chen, Jinlan Fu, Changsong Li, See-Kiong Ng, and Xipeng Qiu. Futureomni: Eval- uating future forecasting from omni-modal context for multimodal llms.arXiv preprint arXiv:2601.13836, 2026

work page arXiv 2026
[25]

Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities, 2025

Ziwei Zhou, Rui Wang, Zuxuan Wu, and Yu-Gang Jiang. Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities.arXiv preprint arXiv:2505.17862, 2025

work page arXiv 2025
[26]

Omnivideobench: Towards audio-visual understanding evaluation for omni mllms, 2025

Caorui Li, Yu Chen, Yiyan Ji, Jin Xu, Zhenyu Cui, Shihao Li, Yuanxing Zhang, Wentao Wang, Zhenghao Song, Dingling Zhang, et al. Omnivideobench: Towards audio-visual understanding evaluation for omni mllms.arXiv preprint arXiv:2510.10689, 2025

work page arXiv 2025
[27]

Hallucination of Multimodal Large Language Models: A Survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930, 2024

work page internal anchor Pith review arXiv 2024
[28]

Avhbench: A cross- modal hallucination benchmark for audio-visual large lan- guage models.arXiv preprint arXiv:2410.18325, 2024

Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, and Tae-Hyun Oh. Avhbench: A cross-modal hallucination benchmark for audio-visual large language models. arXiv preprint arXiv:2410.18325, 2024

work page arXiv 2024
[29]

Learning to decode against compositional hallucination in video multimodal large language models.arXiv preprint arXiv:2602.00559, 2026

Wenbin Xing, Quanxing Zha, Lizheng Zu, Mengran Li, Ming Li, and Junchi Yan. Learning to decode against compositional hallucination in video multimodal large language models.arXiv preprint arXiv:2602.00559, 2026

work page arXiv 2026
[30]

Omnibench: Towards the future of universal omni-language models, 2025

Yizhi Li, Yinghao Ma, Ge Zhang, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, et al. Omnibench: Towards the future of universal omni-language models.arXiv preprint arXiv:2409.15272, 2024. 11

work page arXiv 2024
[31]

Av-odyssey bench: Can your multimodal llms really understand audio-visual information?, 2024

Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, et al. Av-odyssey bench: Can your multimodal llms really understand audio-visual information?arXiv preprint arXiv:2412.02611, 2024

work page arXiv 2024
[32]

Avqa: A dataset for audio-visual question answering on videos

Pinci Yang, Xin Wang, Xuguang Duan, Hong Chen, Runze Hou, Cong Jin, and Wenwu Zhu. Avqa: A dataset for audio-visual question answering on videos. InProceedings of the 30th ACM international conference on multimedia, pages 3480–3491, 2022

work page 2022
[33]

Learning to answer questions in dynamic audio-visual scenarios

Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, and Di Hu. Learning to answer questions in dynamic audio-visual scenarios. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19108–19118, 2022

work page 2022
[34]

Avtrustbench: Assessing and enhancing reliability and robustness in audio-visual llms

Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Yaoting Wang, Mohamed Elhoseiny, Ruohan Gao, and Dinesh Manocha. Avtrustbench: Assessing and enhancing reliability and robustness in audio-visual llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1590–1601, 2025

work page 2025
[35]

arXiv preprint arXiv:2502.04326 (2025)

Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluat- ing real-world omnimodal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025

work page arXiv 2025
[36]

Jointavbench: A benchmark for joint audio-visual reasoning evaluation.arXiv preprint arXiv:2512.12772, 2025

Jianghan Chao, Jianzhang Gao, Wenhui Tan, Yuchong Sun, Ruihua Song, and Liyun Ru. Jointavbench: A benchmark for joint audio-visual reasoning evaluation.arXiv preprint arXiv:2512.12772, 2025

work page internal anchor Pith review arXiv 2025
[37]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

SmolVLM: Redefining small and efficient multimodal models

Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025

work page internal anchor Pith review arXiv 2025
[39]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review arXiv 2025
[40]

Nvila: Efficient frontier visual language models

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4122–4134, 2025

work page 2025
[41]

Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025

LCT Xiaomi and Core Team. Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 1(2): 5, 2025

work page arXiv 2025
[42]

Taste: Text-aligned speech tokenization and embedding for spoken language modeling,

Liang-Hsuan Tseng, Yi-Chang Chen, Kuan-Yi Lee, Da-Shan Shiu, and Hung-yi Lee. Taste: Text-aligned speech tokenization and embedding for spoken language modeling.arXiv preprint arXiv:2504.07053, 2025

work page arXiv 2025
[43]

Kimi-Audio Technical Report

Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report.arXiv preprint arXiv:2504.18425, 2025

work page internal anchor Pith review arXiv 2025
[44]

Audio flamingo 2: An audio-language model with long- audio understanding and expert reasoning abili- ties.arXiv preprint arXiv:2503.03983, 2025

Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro. Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities.arXiv preprint arXiv:2503.03983, 2025

work page arXiv 2025
[45]

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuan- jun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024. 12

work page internal anchor Pith review arXiv 2024
[46]

On The Landscape of Spoken Language Models: A Comprehensive Survey

Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-Yi Lee, Karen Livescu, and Shinji Watanabe. On the landscape of spoken language models: A comprehensive survey.arXiv preprint arXiv:2504.08528, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Ola: Pushing the frontiers of omni-modal language model.arXiv preprint arXiv:2502.04328, 2025

Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Ola: Pushing the frontiers of omni-modal language model.arXiv preprint arXiv:2502.04328, 2025

work page arXiv 2025
[48]

Next-omni: Towards any-to-any omnimodal foundation models with discrete flow matching.arXiv preprint arXiv:2510.13721,

Run Luo, Xiaobo Xia, Lu Wang, Longze Chen, Renke Shan, Jing Luo, Min Yang, and Tat- Seng Chua. Next-omni: Towards any-to-any omnimodal foundation models with discrete flow matching.arXiv preprint arXiv:2510.13721, 2025

work page arXiv 2025
[49]

Vita: Towards open-source interactive omni multimodal llm

Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Yuhang Dai, Meng Zhao, Yi-Fan Zhang, Shaoqi Dong, Yangze Li, Xiong Wang, et al. Vita: Towards open-source interactive omni multimodal llm.arXiv preprint arXiv:2408.05211, 2024

work page arXiv 2024
[50]

M2-omni: Advancing omni-mllm for com- prehensive modality support with competitive performance.arXiv preprint arXiv:2502.18778, 2025

Qingpei Guo, Kaiyou Song, Zipeng Feng, Ziping Ma, Qinglong Zhang, Sirui Gao, Xuzheng Yu, Yunxiao Sun, Tai-Wei Chang, Jingdong Chen, et al. M2-omni: Advancing omni-mllm for com- prehensive modality support with competitive performance.arXiv preprint arXiv:2502.18778, 2025

work page arXiv 2025
[51]

Emoomni: Bridging emotional understanding and expression in omni-modal llms, 2026

Wenjie Tian, Zhixian Zhao, Jingbin Hu, Huakang Chen, Haohe Liu, Binshen Mu, and Lei Xie. Emoomni: Bridging emotional understanding and expression in omni-modal llms.arXiv preprint arXiv:2602.21900, 2026

work page arXiv 2026
[52]

Omnisift: Modality-asymmetric token compression for efficient omni-modal large language models.arXiv preprint arXiv:2602.04804, 2026

Yue Ding, Yiyan Ji, Jungang Li, Xuyang Liu, Xinlong Chen, Junfei Wu, Bozhou Li, Bohan Zeng, Yang Shi, Yushuo Guan, et al. Omnisift: Modality-asymmetric token compression for efficient omni-modal large language models.arXiv preprint arXiv:2602.04804, 2026

work page internal anchor Pith review arXiv 2026
[53]

Omnivinci: Enhancing architecture and data for omni-modal understanding llm.arXiv preprint arXiv:2510.15870, 2025

Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, et al. Omnivinci: Enhancing architecture and data for omni-modal understanding llm.arXiv preprint arXiv:2510.15870, 2025

work page arXiv 2025
[54]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025. URLhttps://arxiv.org/abs/2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

work page 2024
[57]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

work page 2024
[58]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023

work page internal anchor Pith review arXiv 2023
[59]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

Sakshi Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi-task audio understanding and reasoning benchmark.arXiv preprint arXiv:2410.19168, 2024

work page internal anchor Pith review arXiv 2024
[60]

Audiobench: A universal benchmark for audio large language models

Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy Chen. Audiobench: A universal benchmark for audio large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pa...

work page 2025
[61]

Towards understanding chain-of-thought prompting: An empirical study of what matters

Dingdong Wang, Junan Li, Jincenzi Wu, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. Mmsu: A massive multi-task spoken language understanding and reasoning benchmark.arXiv preprint arXiv:2506.04779, 2025

work page arXiv 2025
[62]

What are they doing? joint audio-speech co-reasoning

Yingzhi Wang, Pooneh Mousavi, Artem Ploujnikov, and Mirco Ravanelli. What are they doing? joint audio-speech co-reasoning. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025
[63]

Understanding the fine-grained knowledge capabilities of vision-language models.arXiv preprint arXiv:2602.17871, 2026

Dhruba Ghosh, Yuhui Zhang, and Ludwig Schmidt. Understanding the fine-grained knowledge capabilities of vision-language models.arXiv preprint arXiv:2602.17871, 2026

work page arXiv 2026
[64]

Ver-bench: Evaluating mllms on reasoning with fine-grained visual evidence

Chenhui Qiang, Zhaoyang Wei, Xumeng Han, Zipeng Wang, Siyao Li, Xiangyuan Lan, Jianbin Jiao, and Zhenjun Han. Ver-bench: Evaluating mllms on reasoning with fine-grained visual evidence. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12698–12705, 2025

work page 2025
[65]

Oddgridbench: Exposing the lack of fine-grained visual discrepancy sensitivity in multimodal large language models.arXiv preprint arXiv:2603.09326, 2026

Tengjin Weng, Wenhao Jiang, Jingyi Wang, Ming Li, Lin Ma, and Zhong Ming. Oddgridbench: Exposing the lack of fine-grained visual discrepancy sensitivity in multimodal large language models.arXiv preprint arXiv:2603.09326, 2026

work page arXiv 2026
[66]

Brace: A benchmark for robust audio caption quality evaluation, 2025

Tianyu Guo, Hongyu Chen, Hao Liang, Meiyi Qiang, Bohan Zeng, Linzhuang Sun, Bin Cui, and Wentao Zhang. Brace: A benchmark for robust audio caption quality evaluation.arXiv preprint arXiv:2512.10403, 2025

work page arXiv 2025
[67]

Mmt- 17 bench: A comprehensive multimodal benchmark for eval- uating large vision-language models towards multitask agi

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: A comprehensive multimodal benchmark for eval- uating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006, 2024

work page arXiv 2024
[68]

Mmreason: An open-ended multi- modal multi-step reasoning benchmark for mllms toward agi

Huanjin Yao, Jiaxing Huang, Yawen Qiu, Michael K Chen, Wenzheng Liu, Wei Zhang, Wenjie Zeng, Xikun Zhang, Jingyi Zhang, Yuxin Song, et al. Mmreason: An open-ended multi- modal multi-step reasoning benchmark for mllms toward agi. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 273–283, 2025

work page 2025
[69]

From easy to hard: The mir benchmark for progressive interleaved multi-image reasoning

Hang Du, Jiayang Zhang, Guoshun Nan, Wendi Deng, Zhenyan Chen, Chenyang Zhang, Wang Xiao, Shan Huang, Yuqi Pan, Tao Qi, et al. From easy to hard: The mir benchmark for progressive interleaved multi-image reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 859–869, 2025

work page 2025
[70]

Mmr-life: Piecing together real-life scenes for multimodal multi-image reasoning.arXiv preprint arXiv:2603.02024, 2026

Jiachun Li, Shaoping Huang, Zhuoran Jin, Chenlong Zhang, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. Mmr-life: Piecing together real-life scenes for multimodal multi-image reasoning.arXiv preprint arXiv:2603.02024, 2026

work page arXiv 2026
[71]

Prism- bench: A benchmark of puzzle-based visual tasks with cot error detection.arXiv preprint arXiv:2510.23594, 2025

Yusu Qian, Cheng Wan, Chao Jia, Yinfei Yang, Qingyu Zhao, and Zhe Gan. Prism- bench: A benchmark of puzzle-based visual tasks with cot error detection.arXiv preprint arXiv:2510.23594, 2025

work page arXiv 2025
[72]

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, et al. Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv preprint arXiv:2604.05015, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[73]

Cinepile: A long video question answering dataset and benchmark.arXiv preprint arXiv:2405.08813, 2024

Ruchit Rawal, Khalid Saifullah, Miquel Farré, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein. Cinepile: A long video question answering dataset and benchmark.arXiv preprint arXiv:2405.08813, 2024

work page arXiv 2024
[74]

Mm- verify: Enhancing multimodal reasoning with chain-of- thought verification.arXiv preprint arXiv:2502.13383, 2025

Linzhuang Sun, Hao Liang, Jingxuan Wei, Bihui Yu, Tianpeng Li, Fan Yang, Zenan Zhou, and Wentao Zhang. Mm-verify: Enhancing multimodal reasoning with chain-of-thought verification. arXiv preprint arXiv:2502.13383, 2025

work page arXiv 2025
[75]

Mathscape: Evaluating mllms in multimodal math scenarios through a hierarchical benchmark.arXiv preprint arXiv:2408.07543, 2024

Minxuan Zhou, Hao Liang, Tianpeng Li, Zhiyu Wu, Mingan Lin, Linzhuang Sun, Yaqi Zhou, Yan Zhang, Xiaoqin Huang, Yicong Chen, et al. Mathscape: Evaluating mllms in multimodal math scenarios through a hierarchical benchmark.arXiv preprint arXiv:2408.07543, 2024. 14

work page arXiv 2024
[76]

Llms are noisy oracles! llm-based noise-aware graph active learning for node classification

Zeang Sheng, Weiyang Guo, Yingxia Shao, Wentao Zhang, and Bin Cui. Llms are noisy oracles! llm-based noise-aware graph active learning for node classification. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2, KDD ’25, page 2526–2537, New York, NY , USA, 2025. Association for Computing Machinery. ISBN 97984007145...

work page doi:10.1145/3711896.3737030 2025
[77]

Video-holmes: Can MLLM think like holmes for complex video reasoning?CoRR, abs/2505.21374, 2025

Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025

work page arXiv 2025
[78]

Uno-bench: A unified benchmark for exploring the compositional law between uni-modal and omni-modal in omni models, 2025

Chen Chen, ZeYang Hu, Fengjiao Chen, Liya Ma, Jiaxing Liu, Xiaoyu Li, Ziwen Wang, Xuezhi Cao, and Xunliang Cai. Uno-bench: A unified benchmark for exploring the compositional law between uni-modal and omni-modal in omni models.arXiv preprint arXiv:2510.18915, 2025

work page arXiv 2025
[79]

Omnieval: A benchmark for evaluating omni-modal models with visual, auditory, and textual inputs.arXiv preprint arXiv:2506.20960, 2025

Yiman Zhang, Ziheng Luo, Qiangyu Yan, Wei He, Borui Jiang, Xinghao Chen, and Kai Han. Omnieval: A benchmark for evaluating omni-modal models with visual, auditory, and textual inputs.arXiv preprint arXiv:2506.20960, 2025

work page arXiv 2025
[80]

Longin- sightbench: A comprehensive benchmark for evaluating omni-modal models on human-centric long-video understanding.arXiv preprint arXiv:2510.17305, 2025

ZhaoYang Han, Qihan Lin, Hao Liang, Bowen Chen, Zhou Liu, and Wentao Zhang. Longin- sightbench: A comprehensive benchmark for evaluating omni-modal models on human-centric long-video understanding.arXiv preprint arXiv:2510.17305, 2025

work page arXiv 2025

Showing first 80 references.