pith. sign in
Pith Number

pith:7L2LCIDI

pith:2023:7L2LCIDIL22PSJFQG4OCMYKWBU
not attested not anchored not stored refs resolved

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Conghui He, Guo Chen, Jiashuo Yu, Kunchang Li, Limin Wang, Ping Luo, Xinhao Li, Xin Ma, Xinyuan Chen, Yali Wang, Yaohui Wang, Yinan He, Yi Wang, Yizhuo Li, Yu Qiao, Ziwei Liu

A scalable LLM-based method creates a 7 million video dataset that trains models with leading zero-shot action recognition.

arxiv:2307.06942 v2 · 2023-07-13 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{7L2LCIDIL22PSJFQG4OCMYKWBU}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance.

C2weakest assumption

The multi-scale LLM-generated descriptions are sufficiently accurate and diverse to produce transferable video-text representations without introducing systematic biases or hallucinations that degrade downstream performance.

C3one line summary

InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.

References

82 extracted · 82 resolved · 13 Pith anchors

[1] Language models are few-shot learners 1901
[2] Howto100m: Learning a text-video embedding by watching hundred million narrated video clips 2019
[3] Advancing high-resolution video-language representation with large-scale video transcriptions 2022
[4] Merlot: Multimodal neural script knowledge models 2021
[5] Merlot reserve: Neural script knowledge through vision and language and sound 2022

Formal links

2 machine-checked theorem links

Cited by

37 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:53.259393Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

faf4b120685eb4f924b0371c2661560d14b994c3b8f88f9e5423bea254dd3710

Aliases

arxiv: 2307.06942 · arxiv_version: 2307.06942v2 · doi: 10.48550/arxiv.2307.06942 · pith_short_12: 7L2LCIDIL22P · pith_short_16: 7L2LCIDIL22PSJFQ · pith_short_8: 7L2LCIDI
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/7L2LCIDIL22PSJFQG4OCMYKWBU \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: faf4b120685eb4f924b0371c2661560d14b994c3b8f88f9e5423bea254dd3710
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "66fa1d6696b6cee9acb169649d8eae60a2ccf732088357b099f5377e1f284a88",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by-nc-sa/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2023-07-13T17:58:32Z",
    "title_canon_sha256": "ed56d583d0a3dc7471844fffe8f1ec2c462996e58f23d6c646e929ebea61ff5e"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2307.06942",
    "kind": "arxiv",
    "version": 2
  }
}