{"paper":{"title":"WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs","license":"http://creativecommons.org/licenses/by/4.0/","headline":"The WorldSense benchmark shows that current multimodal models reach at most 65.1 percent accuracy on tasks requiring tight audio-visual synergy in real-world videos.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Jack Hong, Jiayin Cai, Shilin Yan, Weidi Xie, Xiaolong Jiang, Yao Hu","submitted_at":"2025-02-06T18:59:40Z","abstract_excerpt":"We introduce WorldSense, the first benchmark to assess the multi-modal video understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast to existing benchmarks, our WorldSense has several features: (i)collaboration of omni-modality, we design the evaluation tasks to feature a strong coupling of audio and video, requiring models to effectively utilize the synergistic perception of omni-modality; (ii)diversity of videos and tasks, WorldSense encompasses a diverse collection of 1,662 audio-visual synchronised videos, systematically categorized into 8 primary domains"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"The experimental results indicate that existing models face significant challenges in understanding real-world scenarios (65.1% best accuracy).","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the manually annotated QA pairs and the chosen 26 tasks accurately capture the requirements of real-world omnimodal understanding without introducing annotation bias or task selection that favors certain model architectures.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"The WorldSense benchmark shows that current multimodal models reach at most 65.1 percent accuracy on tasks requiring tight audio-visual synergy in real-world videos.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"50adba971c83e578b0db0f61842488817d112d8f728ff1225605f288b2be8517"},"source":{"id":"2502.04326","kind":"arxiv","version":3},"verdict":{"id":"d0ba289f-76ac-403e-85e1-c767a36b905d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T05:48:39.250838Z","strongest_claim":"The experimental results indicate that existing models face significant challenges in understanding real-world scenarios (65.1% best accuracy).","one_line_summary":"WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the manually annotated QA pairs and the chosen 26 tasks accurately capture the requirements of real-world omnimodal understanding without introducing annotation bias or task selection that favors certain model architectures.","pith_extraction_headline":"The WorldSense benchmark shows that current multimodal models reach at most 65.1 percent accuracy on tasks requiring tight audio-visual synergy in real-world videos."},"references":{"count":88,"sample":[{"doi":"","year":2022,"title":"Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736","work_id":"4710bba6-9cf4-4a0c-92c0-404f40a04621","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Introducing the next generation of Claude","work_id":"1cbb4b5a-f4fe-41dd-8d13-af2db291d5b4","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Hourvideo: 1-hour video-language understanding","work_id":"8a5f3847-96f0-4e8c-aa50-0efe6ccf9dab","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Driving with llms: Fusing object-level vector modality for explainable autonomous driving","work_id":"2ee4fb29-d694-421f-9467-3ec2cf9736aa","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","ref_index":5,"cited_arxiv_id":"2412.05271","is_internal_anchor":true}],"resolved_work":88,"snapshot_sha256":"f558c84e3e56ba3ed7577460fc9ca4468d4814064c932ef23b7df9bc6bf693de","internal_anchors":33},"formal_canon":{"evidence_count":2,"snapshot_sha256":"171201a3c31ca9854d1611ed2bbf6311253676993b22fe2fbb7f8c1a8f300d3d"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}