pith. sign in
Pith Number

pith:B67AW5CH

pith:2025:B67AW5CHREYTXE5IWIWTEKXGAE
not attested not anchored not stored refs resolved

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Jack Hong, Jiayin Cai, Shilin Yan, Weidi Xie, Xiaolong Jiang, Yao Hu

The WorldSense benchmark shows that current multimodal models reach at most 65.1 percent accuracy on tasks requiring tight audio-visual synergy in real-world videos.

arxiv:2502.04326 v3 · 2025-02-06 · cs.CV · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{B67AW5CHREYTXE5IWIWTEKXGAE}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

The experimental results indicate that existing models face significant challenges in understanding real-world scenarios (65.1% best accuracy).

C2weakest assumption

That the manually annotated QA pairs and the chosen 26 tasks accurately capture the requirements of real-world omnimodal understanding without introducing annotation bias or task selection that favors certain model architectures.

C3one line summary

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.

References

88 extracted · 88 resolved · 33 Pith anchors

[1] Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736 2022
[2] Introducing the next generation of Claude 2024
[3] Hourvideo: 1-hour video-language understanding 2024
[4] Driving with llms: Fusing object-level vector modality for explainable autonomous driving 2024
[5] Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling 2024 · arXiv:2412.05271

Formal links

2 machine-checked theorem links

Cited by

28 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:14.896721Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

0fbe0b744789313b93a8b22d322ae6012ded139e34f5e623b3294fc1413289ad

Aliases

arxiv: 2502.04326 · arxiv_version: 2502.04326v3 · doi: 10.48550/arxiv.2502.04326 · pith_short_12: B67AW5CHREYT · pith_short_16: B67AW5CHREYTXE5I · pith_short_8: B67AW5CH
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/B67AW5CHREYTXE5IWIWTEKXGAE \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 0fbe0b744789313b93a8b22d322ae6012ded139e34f5e623b3294fc1413289ad
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "e6e693f0a6a7a9ed5e057f30399701ce9984f76b3b747e47a28b490fd1a84d67",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2025-02-06T18:59:40Z",
    "title_canon_sha256": "0ed6a53c364bbeb8baa039d1b28da43b1abbf6d28b95c5ed889d1e5a8e88afe4"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2502.04326",
    "kind": "arxiv",
    "version": 3
  }
}