Baseline reference

Av-odyssey bench: Can your multimodal llms really understand audio-visual information?

10 Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, et al · 2024 · arXiv 2412.02611

Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.

9 Pith papers citing it

Baseline 60% of classified citations

read on arXiv browse 9 citing papers

citation-role summary

background 2 dataset 2 baseline 1

citation-polarity summary

background 2 use dataset 2 baseline 1

representative citing papers

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

cs.SD · 2026-05-09 · unverdicted · novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

cs.MM · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.

Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

cs.CV · 2025-10-16 · conditional · novelty 7.0

XModBench is a tri-modal benchmark that systematically measures cross-modal consistency, modality disparities, and directional imbalances in omni-language models across five task families and all modality combinations.

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

cs.CV · 2025-02-06 · unverdicted · novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.

Qwen2.5-Omni Technical Report

cs.CL · 2025-03-26 · conditional · novelty 5.0

Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

cs.CV · 2025-03-16 · unverdicted · novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

citing papers explorer

Showing 9 of 9 citing papers.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search cs.SD · 2026-05-09 · unverdicted · none · ref 7
Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos cs.CV · 2026-05-08 · unverdicted · none · ref 31
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding cs.CV · 2026-05-21 · unverdicted · none · ref 10
VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation cs.MM · 2026-05-12 · unverdicted · none · ref 29 · 2 links
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs cs.CV · 2026-04-16 · unverdicted · none · ref 8
Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.
XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models cs.CV · 2025-10-16 · conditional · none · ref 8
XModBench is a tri-modal benchmark that systematically measures cross-modal consistency, modality disparities, and directional imbalances in omni-language models across five task families and all modality combinations.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs cs.CV · 2025-02-06 · unverdicted · none · ref 21
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
Qwen2.5-Omni Technical Report cs.CL · 2025-03-26 · conditional · none · ref 17
Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey cs.CV · 2025-03-16 · unverdicted · none · ref 260
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

Av-odyssey bench: Can your multimodal llms really understand audio-visual information?

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer