Pith Number

pith:JO2LRJNG

pith:2023:JO2LRJNGHJ24KLDBJPAXPJVIWT

not attested not anchored not stored refs resolved

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Guo Chen, Jilan Xu, Kunchang Li, Limin Wang, Ping Luo, Yali Wang, Yi Liu, Yinan He, Yi Wang, Yizhuo Li, Yu Qiao, Zun Wang

Most multi-modal AI models fail at temporal understanding in videos, but a new benchmark and training method lift performance by more than 15 percent.

arxiv:2311.17005 v4 · 2023-11-28 · cs.CV

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{JO2LRJNGHJ24KLDBJPAXPJVIWT}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench.

C2weakest assumption

That automatically converting public video annotations into multiple-choice QA pairs accurately measures the intended temporal skills without introducing annotation biases or allowing single-frame shortcuts.

C3one line summary

MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.

References

104 extracted · 104 resolved · 24 Pith anchors

[1] Flamingo: a Visual Language Model for Few-Shot Learning 2022 · arXiv:2204.14198

[2] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond 2023 · arXiv:2308.12966

[3] Frozen in time: A joint video and image encoder for end-to-end retrieval 2021

[4] Ali Furkan Biten, Rub `en P ´erez Tito, Andr ´es Mafla, Llu ´ıs G´omez, Marc ¸al Rusi˜nol, Ernest Valveny, C. V . Jawahar, and Dimosthenis Karatzas. Scene text visual question answer- ing. In ICCV, 20 2019

[5] Language models are few-shot learners 2020

Formal links

2 machine-checked theorem links

Cited by

32 papers in Pith

When Does a Video-Language Model Stop Watching? Reward Strength Controls the Formation and Reversal of Visual Shortcuts in Multimodal RLVR

HPP: Hierarchical Programmatic Probing for Long Video Understanding by Decoupling Perception and Reasoning

NEST: Narrative Event Structures in Time for Long Video Understanding

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning

Receipt and verification

First computed	2026-05-17T23:38:13.189427Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

4bb4b8a5a63a75c52c614bc177a6a8b4f2ef75e3d32702851aad90c82f4dce44

Aliases

arxiv: 2311.17005 · arxiv_version: 2311.17005v4 · doi: 10.48550/arxiv.2311.17005 · pith_short_12: JO2LRJNGHJ24 · pith_short_16: JO2LRJNGHJ24KLDB · pith_short_8: JO2LRJNG

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/JO2LRJNGHJ24KLDBJPAXPJVIWT \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 4bb4b8a5a63a75c52c614bc177a6a8b4f2ef75e3d32702851aad90c82f4dce44

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "daea546185557c858111e16434b39fef7760fcb87d5ec7c76147f5c613d518f3",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2023-11-28T17:59:04Z",
    "title_canon_sha256": "e9370fc3ea60975756dc5270470f9babbffcc54e5543ad4c496f5afa89d98774"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2311.17005",
    "kind": "arxiv",
    "version": 4
  }
}