MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos

Abhinav Shrivastava; Andrew Tao; Arushi Goel; Bryan Catanzaro; Dinesh Manocha; James Case; Kaousheik Jayakumar; Karan Sapra; Katie Lyons; Kevin J. Shih

arxiv: 2603.14145 · v2 · pith:6KGN5CWDnew · submitted 2026-03-14 · 💻 cs.CL · cs.CV

MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos

Arushi Goel , Sreyan Ghosh , Vatsal Agarwal , Nishit Anand , Kaousheik Jayakumar , Lasha Koroshinadze , Yao Xu , Katie Lyons

show 11 more authors

James Case Karan Sapra Kevin J. Shih Siddharth Gururani Abhinav Shrivastava Ramani Duraiswami Dinesh Manocha Andrew Tao Bryan Catanzaro Mohammad Shoeybi Wei Ping

This is my paper

classification 💻 cs.CL cs.CV

keywords mmoumodelsunderstandingvideosbenchmarklongmultimodalreasoning

0 comments

read the original abstract

Multimodal Large Language Models (MLLMs) have shown strong performance in visual and audio understanding when evaluated in isolation. However, their ability to jointly reason over omni-modal (visual, audio, and textual) signals in long and complex videos remains largely unexplored. We introduce MMOU, a new benchmark designed to systematically evaluate multimodal understanding and reasoning under these challenging, real-world conditions. MMOU consists of 20,000 carefully curated questions paired with 11877 web-collected videos of varying length, spanning diverse domains and exhibiting rich, tightly coupled audio-visual content. The benchmark covers 13 fundamental skill categories, all of which require integrating evidence across modalities and time. All questions are manually annotated across multiple turns by professional annotators, ensuring high quality and reasoning fidelity. We evaluate 20+ state-of-the-art open-source and proprietary multimodal models on MMOU. The results expose substantial performance gaps: the best closed-source model achieves only 64.2% accuracy, while the strongest open-source model reaches just 46.8%. Our results highlight the challenges of long-form omni-modal understanding, revealing that current models frequently fail to apply even fundamental skills in long videos. Through detailed analysis, we further identify systematic failure modes and provide insights into where and why current models break.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search
cs.SD 2026-05 unverdicted novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
Sandboxed Coding Agents are Competitive Omni-modal Task Solvers
cs.CL 2026-05 unverdicted novelty 7.0

Sandboxed coding agents with text+image access match or outperform native omnimodal models on audio-video benchmarks by converting tasks into code-driven retrieval and processing.
Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction
cs.CV 2026-05 conditional novelty 7.0

Omni-DuplexEval creates a new benchmark and LLM-as-a-Judge framework for real-time duplex omni-modal interaction, revealing that current models score below 40% overall and struggle especially with proactive responses.
OmniHalluc-L: Counterfactual Benchmarking and Modality-Perturbation Reliability Calibration for Long-Form Omni Hallucination
cs.MM 2026-06 unverdicted novelty 6.0

OmniHalluc-L benchmark shows open-weight omni models at 32-41% strict-pair accuracy on long-form hallucination, raised to 36-51% by Modality-Perturbation Reliability Calibration that fuses audio-negative probe shifts ...