{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:M44AZR7KUISSMYFGAT2PSYYPBM","short_pith_number":"pith:M44AZR7K","schema_version":"1.0","canonical_sha256":"67380cc7eaa2252660a604f4f9630f0b3c355564591318ef7194b0b8e63d550c","source":{"kind":"arxiv","id":"2407.02371","version":3},"attestation_state":"computed","paper":{"title":"OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"OpenVid-1M supplies over a million precise text-video pairs with expressive captions to improve text-to-video generation.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Jian Yang, Kepan Nan, Penghao Zhou, Rui Xie, Tiehan Fan, Xiang Li, Ying Tai, Zhenheng Yang, Zhijie Chen","submitted_at":"2024-07-02T15:40:29Z","abstract_excerpt":"Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The previous popular video datasets, e.g. WebVid-10M and Panda-70M, are either with low quality or too large for most research institutions. Therefore, it is challenging but crucial to collect a precise high-quality text-video pairs for T2V generation. 2) Ignoring to fully utilize textual information. Recent T2V methods have focused on vision transformers, u"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2407.02371","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2024-07-02T15:40:29Z","cross_cats_sorted":[],"title_canon_sha256":"6674247ef4e27bb49c2ec829d0b8e94091ddb7195d8285ce828c482c1465f25f","abstract_canon_sha256":"230a2191ea85b2201b99bf2b8f086ab595e36b5159b23b660a63a7a64b90a4e2"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:39:21.817581Z","signature_b64":"CJgUeDV3hrFCFCcPD4TK2s84nlSjO0SHjDgwC3ocEP6MEztH/iDXRBhqAJsoi4Sk21+w1KWbXakcIUM/U2m3Bw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"67380cc7eaa2252660a604f4f9630f0b3c355564591318ef7194b0b8e63d550c","last_reissued_at":"2026-05-17T23:39:21.816981Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:39:21.816981Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"OpenVid-1M supplies over a million precise text-video pairs with expressive captions to improve text-to-video generation.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Jian Yang, Kepan Nan, Penghao Zhou, Rui Xie, Tiehan Fan, Xiang Li, Ying Tai, Zhenheng Yang, Zhijie Chen","submitted_at":"2024-07-02T15:40:29Z","abstract_excerpt":"Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The previous popular video datasets, e.g. WebVid-10M and Panda-70M, are either with low quality or too large for most research institutions. Therefore, it is challenging but crucial to collect a precise high-quality text-video pairs for T2V generation. 2) Ignoring to fully utilize textual information. Recent T2V methods have focused on vision transformers, u"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"we introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million text-video pairs, facilitating research on T2V generation. Furthermore, we curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M... Additionally, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from visual tokens and semantic information from text tokens.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the newly collected videos and captions are verifiably higher quality and more precise than prior datasets such as WebVid-10M and Panda-70M, and that the MVDiT architecture delivers measurable gains attributable to its joint structure-semantic processing rather than other training factors.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"OpenVid-1M supplies 1 million high-quality text-video pairs and introduces MVDiT to improve text-to-video generation by better using both visual structure and text semantics.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"OpenVid-1M supplies over a million precise text-video pairs with expressive captions to improve text-to-video generation.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"d1e5a278a927f2f0a14fb25bf3a2e912f56c98493e2af206dcb0108d0b413303"},"source":{"id":"2407.02371","kind":"arxiv","version":3},"verdict":{"id":"7a7d6e26-b04b-4512-b9c3-177d8561f47c","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T20:30:33.458442Z","strongest_claim":"we introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million text-video pairs, facilitating research on T2V generation. Furthermore, we curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M... Additionally, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from visual tokens and semantic information from text tokens.","one_line_summary":"OpenVid-1M supplies 1 million high-quality text-video pairs and introduces MVDiT to improve text-to-video generation by better using both visual structure and text semantics.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the newly collected videos and captions are verifiably higher quality and more precise than prior datasets such as WebVid-10M and Panda-70M, and that the MVDiT architecture delivers measurable gains attributable to its joint structure-semantic processing rather than other training factors.","pith_extraction_headline":"OpenVid-1M supplies over a million precise text-video pairs with expressive captions to improve text-to-video generation."},"references":{"count":16,"sample":[{"doi":"","year":null,"title":"Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets","work_id":"4f68eada-27e3-437a-a2fe-6e4ca524d0d3","ref_index":1,"cited_arxiv_id":"2311.15127","is_internal_anchor":true},{"doi":"","year":null,"title":"VideoCrafter1: Open Diffusion Models for High-Quality Video Generation","work_id":"4d4486c5-6317-4d8d-bb5b-3b100d732a83","ref_index":2,"cited_arxiv_id":"2310.19512","is_internal_anchor":true},{"doi":"","year":null,"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","ref_index":3,"cited_arxiv_id":"1412.6980","is_internal_anchor":true},{"doi":"","year":2024,"title":"arXiv preprint arXiv:2310.11440 (2023) 2, 4","work_id":"7e755299-7217-4af5-a711-0d29a31768fb","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Latte: Latent Diffusion Transformer for Video Generation","work_id":"5328e907-7278-4781-a2bb-c5ef40dc87fb","ref_index":5,"cited_arxiv_id":"2401.03048","is_internal_anchor":true}],"resolved_work":16,"snapshot_sha256":"97813cc35b2ce4c32c795a3b2aba207cca435a803e2588e68f6a894280c6688f","internal_anchors":7},"formal_canon":{"evidence_count":2,"snapshot_sha256":"8592a80c8d210d691955b92044057aa2dcc2f948421406dcfe90d9009d355ee8"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2407.02371","created_at":"2026-05-17T23:39:21.817076+00:00"},{"alias_kind":"arxiv_version","alias_value":"2407.02371v3","created_at":"2026-05-17T23:39:21.817076+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2407.02371","created_at":"2026-05-17T23:39:21.817076+00:00"},{"alias_kind":"pith_short_12","alias_value":"M44AZR7KUISS","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"M44AZR7KUISSMYFG","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"M44AZR7K","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":35,"internal_anchor_count":35,"sample":[{"citing_arxiv_id":"2605.23878","citing_title":"LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.23518","citing_title":"VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20659","citing_title":"RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17248","citing_title":"Image-to-Video Diffusion: From Foundations to Open Frontiers","ref_index":100,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17260","citing_title":"LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17543","citing_title":"HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos","ref_index":71,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18365","citing_title":"GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19242","citing_title":"PhyWorld: Physics-Faithful World Model for Video Generation","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2512.00336","citing_title":"MVAD: A Benchmark Dataset for Multimodal AI-Generated Video-Audio Detection","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2409.04429","citing_title":"VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2509.18154","citing_title":"MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2603.09283","citing_title":"From Ideal to Real: Stable Video Object Removal under Imperfect Conditions","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12119","citing_title":"MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12957","citing_title":"GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12119","citing_title":"MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2506.15564","citing_title":"Show-o2: Improved Native Unified Multimodal Models","ref_index":80,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06356","citing_title":"SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23789","citing_title":"MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24762","citing_title":"OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23789","citing_title":"MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06356","citing_title":"SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21931","citing_title":"Seeing Fast and Slow: Learning the Flow of Time in Videos","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12270","citing_title":"DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11627","citing_title":"POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16479","citing_title":"Latent-Compressed Variational Autoencoder for Video Diffusion Models","ref_index":32,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/M44AZR7KUISSMYFGAT2PSYYPBM","json":"https://pith.science/pith/M44AZR7KUISSMYFGAT2PSYYPBM.json","graph_json":"https://pith.science/api/pith-number/M44AZR7KUISSMYFGAT2PSYYPBM/graph.json","events_json":"https://pith.science/api/pith-number/M44AZR7KUISSMYFGAT2PSYYPBM/events.json","paper":"https://pith.science/paper/M44AZR7K"},"agent_actions":{"view_html":"https://pith.science/pith/M44AZR7KUISSMYFGAT2PSYYPBM","download_json":"https://pith.science/pith/M44AZR7KUISSMYFGAT2PSYYPBM.json","view_paper":"https://pith.science/paper/M44AZR7K","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2407.02371&json=true","fetch_graph":"https://pith.science/api/pith-number/M44AZR7KUISSMYFGAT2PSYYPBM/graph.json","fetch_events":"https://pith.science/api/pith-number/M44AZR7KUISSMYFGAT2PSYYPBM/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/M44AZR7KUISSMYFGAT2PSYYPBM/action/timestamp_anchor","attest_storage":"https://pith.science/pith/M44AZR7KUISSMYFGAT2PSYYPBM/action/storage_attestation","attest_author":"https://pith.science/pith/M44AZR7KUISSMYFGAT2PSYYPBM/action/author_attestation","sign_citation":"https://pith.science/pith/M44AZR7KUISSMYFGAT2PSYYPBM/action/citation_signature","submit_replication":"https://pith.science/pith/M44AZR7KUISSMYFGAT2PSYYPBM/action/replication_record"}},"created_at":"2026-05-17T23:39:21.817076+00:00","updated_at":"2026-05-17T23:39:21.817076+00:00"}