pith. sign in
Pith Number

pith:M44AZR7K

pith:2024:M44AZR7KUISSMYFGAT2PSYYPBM
not attested not anchored not stored refs resolved

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Jian Yang, Kepan Nan, Penghao Zhou, Rui Xie, Tiehan Fan, Xiang Li, Ying Tai, Zhenheng Yang, Zhijie Chen

OpenVid-1M supplies over a million precise text-video pairs with expressive captions to improve text-to-video generation.

arxiv:2407.02371 v3 · 2024-07-02 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{M44AZR7KUISSMYFGAT2PSYYPBM}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

we introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million text-video pairs, facilitating research on T2V generation. Furthermore, we curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M... Additionally, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from visual tokens and semantic information from text tokens.

C2weakest assumption

That the newly collected videos and captions are verifiably higher quality and more precise than prior datasets such as WebVid-10M and Panda-70M, and that the MVDiT architecture delivers measurable gains attributable to its joint structure-semantic processing rather than other training factors.

C3one line summary

OpenVid-1M supplies 1 million high-quality text-video pairs and introduces MVDiT to improve text-to-video generation by better using both visual structure and text semantics.

References

16 extracted · 16 resolved · 7 Pith anchors

[1] Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets · arXiv:2311.15127
[2] VideoCrafter1: Open Diffusion Models for High-Quality Video Generation · arXiv:2310.19512
[3] Adam: A Method for Stochastic Optimization · arXiv:1412.6980
[4] arXiv preprint arXiv:2310.11440 (2023) 2, 4 2024
[5] Latte: Latent Diffusion Transformer for Video Generation · arXiv:2401.03048

Formal links

2 machine-checked theorem links

Cited by

35 papers in Pith

Receipt and verification
First computed 2026-05-17T23:39:21.816981Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

67380cc7eaa2252660a604f4f9630f0b3c355564591318ef7194b0b8e63d550c

Aliases

arxiv: 2407.02371 · arxiv_version: 2407.02371v3 · doi: 10.48550/arxiv.2407.02371 · pith_short_12: M44AZR7KUISS · pith_short_16: M44AZR7KUISSMYFG · pith_short_8: M44AZR7K
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/M44AZR7KUISSMYFGAT2PSYYPBM \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 67380cc7eaa2252660a604f4f9630f0b3c355564591318ef7194b0b8e63d550c
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "230a2191ea85b2201b99bf2b8f086ab595e36b5159b23b660a63a7a64b90a4e2",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-07-02T15:40:29Z",
    "title_canon_sha256": "6674247ef4e27bb49c2ec829d0b8e94091ddb7195d8285ce828c482c1465f25f"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2407.02371",
    "kind": "arxiv",
    "version": 3
  }
}