Pith Number

pith:B67AW5CH

pith:2025:B67AW5CHREYTXE5IWIWTEKXGAE

not attested not anchored not stored refs resolved

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Jack Hong, Jiayin Cai, Shilin Yan, Weidi Xie, Xiaolong Jiang, Yao Hu

The WorldSense benchmark shows that current multimodal models reach at most 65.1 percent accuracy on tasks requiring tight audio-visual synergy in real-world videos.

arxiv:2502.04326 v3 · 2025-02-06 · cs.CV · cs.AI

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{B67AW5CHREYTXE5IWIWTEKXGAE}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

The experimental results indicate that existing models face significant challenges in understanding real-world scenarios (65.1% best accuracy).

C2weakest assumption

That the manually annotated QA pairs and the chosen 26 tasks accurately capture the requirements of real-world omnimodal understanding without introducing annotation bias or task selection that favors certain model architectures.

C3one line summary

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.

References

88 extracted · 88 resolved · 33 Pith anchors

[1] Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736 2022

[2] Introducing the next generation of Claude 2024

[3] Hourvideo: 1-hour video-language understanding 2024

[4] Driving with llms: Fusing object-level vector modality for explainable autonomous driving 2024

[5] Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling 2024 · arXiv:2412.05271

Formal links

2 machine-checked theorem links

Cited by

28 papers in Pith

VISD: Enhancing Video Reasoning via Structured Self-Distillation

VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

When Vision Speaks for Sound

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

Receipt and verification

First computed	2026-05-17T23:38:14.896721Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

0fbe0b744789313b93a8b22d322ae6012ded139e34f5e623b3294fc1413289ad

Aliases

arxiv: 2502.04326 · arxiv_version: 2502.04326v3 · doi: 10.48550/arxiv.2502.04326 · pith_short_12: B67AW5CHREYT · pith_short_16: B67AW5CHREYTXE5I · pith_short_8: B67AW5CH

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/B67AW5CHREYTXE5IWIWTEKXGAE \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 0fbe0b744789313b93a8b22d322ae6012ded139e34f5e623b3294fc1413289ad

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "e6e693f0a6a7a9ed5e057f30399701ce9984f76b3b747e47a28b490fd1a84d67",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2025-02-06T18:59:40Z",
    "title_canon_sha256": "0ed6a53c364bbeb8baa039d1b28da43b1abbf6d28b95c5ed889d1e5a8e88afe4"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2502.04326",
    "kind": "arxiv",
    "version": 3
  }
}