Pith Number

pith:7L2LCIDI

pith:2023:7L2LCIDIL22PSJFQG4OCMYKWBU

not attested not anchored not stored refs resolved

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Conghui He, Guo Chen, Jiashuo Yu, Kunchang Li, Limin Wang, Ping Luo, Xinhao Li, Xin Ma, Xinyuan Chen, Yali Wang, Yaohui Wang, Yinan He, Yi Wang, Yizhuo Li, Yu Qiao, Ziwei Liu

A scalable LLM-based method creates a 7 million video dataset that trains models with leading zero-shot action recognition.

arxiv:2307.06942 v2 · 2023-07-13 · cs.CV

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{7L2LCIDIL22PSJFQG4OCMYKWBU}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance.

C2weakest assumption

The multi-scale LLM-generated descriptions are sufficiently accurate and diverse to produce transferable video-text representations without introducing systematic biases or hallucinations that degrade downstream performance.

C3one line summary

InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.

References

82 extracted · 82 resolved · 13 Pith anchors

[1] Language models are few-shot learners 1901

[2] Howto100m: Learning a text-video embedding by watching hundred million narrated video clips 2019

[3] Advancing high-resolution video-language representation with large-scale video transcriptions 2022

[4] Merlot: Multimodal neural script knowledge models 2021

[5] Merlot reserve: Neural script knowledge through vision and language and sound 2022

Formal links

2 machine-checked theorem links

Cited by

37 papers in Pith

EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation

DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization

VideoPhy: Evaluating Physical Commonsense for Video Generation

HAVEN: Hierarchically Aligned Multimodal Benchmark for Unified Video Understanding

Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

Receipt and verification

First computed	2026-05-17T23:38:53.259393Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

faf4b120685eb4f924b0371c2661560d14b994c3b8f88f9e5423bea254dd3710

Aliases

arxiv: 2307.06942 · arxiv_version: 2307.06942v2 · doi: 10.48550/arxiv.2307.06942 · pith_short_12: 7L2LCIDIL22P · pith_short_16: 7L2LCIDIL22PSJFQ · pith_short_8: 7L2LCIDI

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/7L2LCIDIL22PSJFQG4OCMYKWBU \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: faf4b120685eb4f924b0371c2661560d14b994c3b8f88f9e5423bea254dd3710

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "66fa1d6696b6cee9acb169649d8eae60a2ccf732088357b099f5377e1f284a88",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by-nc-sa/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2023-07-13T17:58:32Z",
    "title_canon_sha256": "ed56d583d0a3dc7471844fffe8f1ec2c462996e58f23d6c646e929ebea61ff5e"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2307.06942",
    "kind": "arxiv",
    "version": 2
  }
}