pith. machine review for the scientific record. sign in
Pith Number

pith:JO2LRJNG

pith:2023:JO2LRJNGHJ24KLDBJPAXPJVIWT
not attested not anchored not stored refs resolved

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Guo Chen, Jilan Xu, Kunchang Li, Limin Wang, Ping Luo, Yali Wang, Yi Liu, Yinan He, Yi Wang, Yizhuo Li, Yu Qiao, Zun Wang

Most multi-modal AI models fail at temporal understanding in videos, but a new benchmark and training method lift performance by more than 15 percent.

arxiv:2311.17005 v4 · 2023-11-28 · cs.CV

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench.

C2weakest assumption

That automatically converting public video annotations into multiple-choice QA pairs accurately measures the intended temporal skills without introducing annotation biases or allowing single-frame shortcuts.

C3one line summary

MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.

References

104 extracted · 104 resolved · 24 Pith anchors

[1] Flamingo: a Visual Language Model for Few-Shot Learning 2022 · arXiv:2204.14198
[2] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond 2023 · arXiv:2308.12966
[3] Frozen in time: A joint video and image encoder for end-to-end retrieval 2021
[4] Ali Furkan Biten, Rub `en P ´erez Tito, Andr ´es Mafla, Llu ´ıs G´omez, Marc ¸al Rusi˜nol, Ernest Valveny, C. V . Jawahar, and Dimosthenis Karatzas. Scene text visual question answer- ing. In ICCV, 20 2019
[5] Language models are few-shot learners 2020

Formal links

2 machine-checked theorem links

Cited by

16 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:13.189427Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

4bb4b8a5a63a75c52c614bc177a6a8b4f2ef75e3d32702851aad90c82f4dce44

Aliases

arxiv: 2311.17005 · arxiv_version: 2311.17005v4 · doi: 10.48550/arxiv.2311.17005 · pith_short_12: JO2LRJNGHJ24 · pith_short_16: JO2LRJNGHJ24KLDB · pith_short_8: JO2LRJNG
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/JO2LRJNGHJ24KLDBJPAXPJVIWT \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 4bb4b8a5a63a75c52c614bc177a6a8b4f2ef75e3d32702851aad90c82f4dce44
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "daea546185557c858111e16434b39fef7760fcb87d5ec7c76147f5c613d518f3",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2023-11-28T17:59:04Z",
    "title_canon_sha256": "e9370fc3ea60975756dc5270470f9babbffcc54e5543ad4c496f5afa89d98774"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2311.17005",
    "kind": "arxiv",
    "version": 4
  }
}