pith. sign in
Pith Number

pith:YEK2EB3U

pith:2025:YEK2EB3UZILCNPFUT75SKOASPQ
not attested not anchored not stored refs resolved

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Bo Li, Fanyi Pu, Kairui Hu, Penghao Wu, Wang Xiao, Xiang Yue, Yuanhan Zhang, Ziwei Liu

Video-MMMU benchmark shows large multimodal models decline sharply in performance as video tasks require more cognitive adaptation.

arxiv:2501.13826 v1 · 2025-01-23 · cs.CV · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{YEK2EB3UZILCNPFUT75SKOASPQ}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.

C2weakest assumption

That the 300 videos and 900 human-annotated questions accurately and unbiasedly capture the three cognitive stages of knowledge acquisition without selection or annotation artifacts affecting the measured gaps.

C3one line summary

Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.

References

62 extracted · 62 resolved · 14 Pith anchors

[1] Anthropic. Claude Team. Introducing Claude 3.5 Sonnet. https://www.anthropic.com/claude/sonnet ,
[2] A systematic classification of knowl- edge, reasoning, and context within the ARC dataset 2018
[3] Temporalbench: Towards fine-grained temporal understanding for multimodal video models 2024
[4] Auroracap: Efficient, performant video detailed captioning and a new benchmark 2024
[5] Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answer- ing 2023

Formal links

1 machine-checked theorem link

Cited by

42 papers in Pith

Receipt and verification
First computed 2026-05-18T03:19:23.485360Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

c115a20774ca1626bcb49ffb2538127c01322595c20ebd0fdb3baf0c12ded52d

Aliases

arxiv: 2501.13826 · arxiv_version: 2501.13826v1 · doi: 10.48550/arxiv.2501.13826 · pith_short_12: YEK2EB3UZILC · pith_short_16: YEK2EB3UZILCNPFU · pith_short_8: YEK2EB3U
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/YEK2EB3UZILCNPFUT75SKOASPQ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: c115a20774ca1626bcb49ffb2538127c01322595c20ebd0fdb3baf0c12ded52d
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "e26b739eaab3a336fdbef524f8c1175a46f3aac3cfe83c64a58b41e5c4e16c7a",
    "cross_cats_sorted": [
      "cs.CL"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2025-01-23T16:51:47Z",
    "title_canon_sha256": "d04a3d1b3579b429fd81cd2b06c42dc5c53e786c742b441ef725394c06e527a3"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2501.13826",
    "kind": "arxiv",
    "version": 1
  }
}