{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2022:U52T37GUSXOPBDROGA4KV2MDWC","short_pith_number":"pith:U52T37GU","schema_version":"1.0","canonical_sha256":"a7753dfcd495dcf08e2e3038aae983b0816e34f4ab1fadbd1e3ba9fe6640db33","source":{"kind":"arxiv","id":"2211.13221","version":2},"attestation_state":"computed","paper":{"title":"Latent Video Diffusion Models for High-Fidelity Long Video Generation","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Video diffusion models shift to a low-dimensional 3D latent space to generate realistic clips longer than 1000 frames with modest compute.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Qifeng Chen, Tianyu Yang, Yingqing He, Ying Shan, Yong Zhang","submitted_at":"2022-11-23T18:58:39Z","abstract_excerpt":"AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited comput"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2211.13221","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2022-11-23T18:58:39Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"c6faa78873c360f2d65fa170a921710b4b4f23535ada711afe37d86bc2dc53c3","abstract_canon_sha256":"6dbcccf4bb7c02fbfe9928bc7b713e502af96cf6459bfa583f91f5be25e07262"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:53.535523Z","signature_b64":"yT6ImFDoouJrc2mgCZ2/nfIBMtYCfhCfOUECi6le5WHghj5sNarndgMaB7NaHGYGzcoT6xthMEcM0vT+hIpBAQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"a7753dfcd495dcf08e2e3038aae983b0816e34f4ab1fadbd1e3ba9fe6640db33","last_reissued_at":"2026-05-17T23:38:53.534898Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:53.534898Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Latent Video Diffusion Models for High-Fidelity Long Video Generation","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Video diffusion models shift to a low-dimensional 3D latent space to generate realistic clips longer than 1000 frames with modest compute.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Qifeng Chen, Tianyu Yang, Yingqing He, Ying Shan, Yong Zhang","submitted_at":"2022-11-23T18:58:39Z","abstract_excerpt":"AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited comput"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget... hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced... conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The low-dimensional 3D latent space preserves sufficient spatial-temporal detail for high-fidelity generation, and the added perturbation and guidance steps prevent error accumulation without introducing new artifacts or inconsistencies.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Latent-space hierarchical diffusion models with targeted error-correction techniques generate realistic videos exceeding 1000 frames while using less compute than prior pixel-space approaches.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Video diffusion models shift to a low-dimensional 3D latent space to generate realistic clips longer than 1000 frames with modest compute.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"414298ca8caf73e6cf6215dd13e84fccb4e6b26da7f47dfa0449112068a1dd6e"},"source":{"id":"2211.13221","kind":"arxiv","version":2},"verdict":{"id":"aa27081a-e0f9-485f-ada9-264af80c1216","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T04:23:42.819005Z","strongest_claim":"we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget... hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced... conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length.","one_line_summary":"Latent-space hierarchical diffusion models with targeted error-correction techniques generate realistic videos exceeding 1000 frames while using less compute than prior pixel-space approaches.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The low-dimensional 3D latent space preserves sufficient spatial-temporal detail for high-fidelity generation, and the added perturbation and guidance steps prevent error accumulation without introducing new artifacts or inconsistencies.","pith_extraction_headline":"Video diffusion models shift to a low-dimensional 3D latent space to generate realistic clips longer than 1000 frames with modest compute."},"references":{"count":48,"sample":[{"doi":"","year":2019,"title":"Large scale GAN training for high ﬁdelity natural image synthesis","work_id":"af262cdc-3bc5-47cd-8871-360f893535c0","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Generating long videos of dynamic scenes","work_id":"443f47a8-d87d-4758-80b8-84e5a8c0f8b4","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Hier- archical video generation for complex data","work_id":"70163995-e878-4b3a-a64e-14bf55c851be","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Diffusion models beat gans on image synthesis","work_id":"9f6d98a1-8a67-4c73-9285-5883f9f33a56","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Taming transformers for high-resolution image synthesis","work_id":"79ce61b1-69be-4667-80fd-eb5b40b6dcb4","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":48,"snapshot_sha256":"0997c4fba9953ccdfbff9df324b6b9da495f3c5957905ffa6613ee8e2beba715","internal_anchors":15},"formal_canon":{"evidence_count":2,"snapshot_sha256":"ec97c6e8e91c6e42965c4e19c01b8eb6256e69265d6c3a187c03456ff8c7e8fe"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2211.13221","created_at":"2026-05-17T23:38:53.535025+00:00"},{"alias_kind":"arxiv_version","alias_value":"2211.13221v2","created_at":"2026-05-17T23:38:53.535025+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2211.13221","created_at":"2026-05-17T23:38:53.535025+00:00"},{"alias_kind":"pith_short_12","alias_value":"U52T37GUSXOP","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"U52T37GUSXOPBDRO","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"U52T37GU","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":41,"internal_anchor_count":41,"sample":[{"citing_arxiv_id":"2503.06310","citing_title":"Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2504.17180","citing_title":"We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2504.18576","citing_title":"DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2505.16819","citing_title":"Character-Centered Dialogue Generation from Scene-Level Prompts","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2602.02214","citing_title":"Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2511.22940","citing_title":"One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2602.02214","citing_title":"Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2603.03066","citing_title":"EduVQA: Towards Concept-Aware Assessment of Educational AI-Generated Videos","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16530","citing_title":"SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16530","citing_title":"SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2308.08089","citing_title":"DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory","ref_index":294,"is_internal_anchor":true},{"citing_arxiv_id":"2506.09981","citing_title":"ReSim: Reliable World Simulation for Autonomous Driving","ref_index":106,"is_internal_anchor":true},{"citing_arxiv_id":"2503.19325","citing_title":"Long-Context Autoregressive Video Modeling with Next-Frame Prediction","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2512.04678","citing_title":"Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2502.06764","citing_title":"History-Guided Video Diffusion","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2310.06114","citing_title":"Learning Interactive Real-World Simulators","ref_index":197,"is_internal_anchor":true},{"citing_arxiv_id":"2510.02283","citing_title":"Self-Forcing++: Towards Minute-Scale High-Quality Video Generation","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2312.14125","citing_title":"VideoPoet: A Large Language Model for Zero-Shot Video Generation","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2603.09721","citing_title":"FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2603.17812","citing_title":"ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14136","citing_title":"TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2509.22622","citing_title":"LongLive: Real-time Interactive Long Video Generation","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2310.19512","citing_title":"VideoCrafter1: Open Diffusion Models for High-Quality Video Generation","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2503.21755","citing_title":"VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13775","citing_title":"RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data","ref_index":51,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/U52T37GUSXOPBDROGA4KV2MDWC","json":"https://pith.science/pith/U52T37GUSXOPBDROGA4KV2MDWC.json","graph_json":"https://pith.science/api/pith-number/U52T37GUSXOPBDROGA4KV2MDWC/graph.json","events_json":"https://pith.science/api/pith-number/U52T37GUSXOPBDROGA4KV2MDWC/events.json","paper":"https://pith.science/paper/U52T37GU"},"agent_actions":{"view_html":"https://pith.science/pith/U52T37GUSXOPBDROGA4KV2MDWC","download_json":"https://pith.science/pith/U52T37GUSXOPBDROGA4KV2MDWC.json","view_paper":"https://pith.science/paper/U52T37GU","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2211.13221&json=true","fetch_graph":"https://pith.science/api/pith-number/U52T37GUSXOPBDROGA4KV2MDWC/graph.json","fetch_events":"https://pith.science/api/pith-number/U52T37GUSXOPBDROGA4KV2MDWC/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/U52T37GUSXOPBDROGA4KV2MDWC/action/timestamp_anchor","attest_storage":"https://pith.science/pith/U52T37GUSXOPBDROGA4KV2MDWC/action/storage_attestation","attest_author":"https://pith.science/pith/U52T37GUSXOPBDROGA4KV2MDWC/action/author_attestation","sign_citation":"https://pith.science/pith/U52T37GUSXOPBDROGA4KV2MDWC/action/citation_signature","submit_replication":"https://pith.science/pith/U52T37GUSXOPBDROGA4KV2MDWC/action/replication_record"}},"created_at":"2026-05-17T23:38:53.535025+00:00","updated_at":"2026-05-17T23:38:53.535025+00:00"}