{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:OVTZMQ4L5I6FT2X6KD52IQZJDF","short_pith_number":"pith:OVTZMQ4L","schema_version":"1.0","canonical_sha256":"756796438bea3c59eafe50fba44329197d7363b68df6f6c2c614f33ca7b2c00e","source":{"kind":"arxiv","id":"2502.06764","version":2},"attestation_state":"computed","paper":{"title":"History-Guided Video Diffusion","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Diffusion Forcing Transformer lets video models condition on any number of past frames.","cross_cats":["cs.CV"],"primary_cat":"cs.LG","authors_text":"Boyuan Chen, Kiwhan Song, Max Simchowitz, Russ Tedrake, Vincent Sitzmann, Yilun Du","submitted_at":"2025-02-10T18:44:25Z","abstract_excerpt":"Classifier-free guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this technique to video diffusion, which generates video conditioned on a variable number of context frames, collectively referred to as history. However, we find two key challenges to guiding with variable-length history: architectures that only support fixed-size conditioning, and the empirical observation that CFG-style history dropout performs poorly. To address this, we propose the Diffusion Forci"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2502.06764","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.LG","submitted_at":"2025-02-10T18:44:25Z","cross_cats_sorted":["cs.CV"],"title_canon_sha256":"cd40cad7c6e5ff3cfb9fe443a7080dd443898d14479092a35d08ed647021426f","abstract_canon_sha256":"7f67076de87788c69b47d5551c71b2d7952f4d9b071ccc3f97727c58fedf0259"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:47.953804Z","signature_b64":"L0DGg4b4PtvgZW+Zrwc0z6vjlIzQJQwjBCsIudkQza9fA3SfqzcDG+3n0QWTdmo9bCzeqOAv1kMC05OfEd9yDQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"756796438bea3c59eafe50fba44329197d7363b68df6f6c2c614f33ca7b2c00e","last_reissued_at":"2026-05-17T23:38:47.953184Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:47.953184Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"History-Guided Video Diffusion","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Diffusion Forcing Transformer lets video models condition on any number of past frames.","cross_cats":["cs.CV"],"primary_cat":"cs.LG","authors_text":"Boyuan Chen, Kiwhan Song, Max Simchowitz, Russ Tedrake, Vincent Sitzmann, Yilun Du","submitted_at":"2025-02-10T18:44:25Z","abstract_excerpt":"Classifier-free guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this technique to video diffusion, which generates video conditioned on a variable number of context frames, collectively referred to as history. However, we find two key challenges to guiding with variable-length history: architectures that only support fixed-size conditioning, and the empirical observation that CFG-style history dropout performs poorly. To address this, we propose the Diffusion Forci"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We propose the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. We then introduce History Guidance, a family of guidance methods uniquely enabled by DFoT.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the DFoT training objective and architecture truly support arbitrary-length history without hidden performance costs or instability, and that the proposed history guidance methods generalize beyond the tested datasets and lengths.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Diffusion Forcing Transformer lets video models condition on any number of past frames.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"ab9d39b15b60defad8a26c7b4abf729c9f014c06e4cf49b993c1309483d4729b"},"source":{"id":"2502.06764","kind":"arxiv","version":2},"verdict":{"id":"9276c95b-0f63-4762-9abf-327be7b61973","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T11:56:30.341937Z","strongest_claim":"We propose the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. We then introduce History Guidance, a family of guidance methods uniquely enabled by DFoT.","one_line_summary":"DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the DFoT training objective and architecture truly support arbitrary-length history without hidden performance costs or instability, and that the proposed history guidance methods generalize beyond the tested datasets and lengths.","pith_extraction_headline":"Diffusion Forcing Transformer lets video models condition on any number of past frames."},"references":{"count":70,"sample":[{"doi":"","year":2023,"title":"All are worth words: A vit backbone for diffusion models","work_id":"4b93ec35-06cf-40f1-8cbe-c6b896c36f19","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2017,"title":"Bellec, P. C. Optimal exponential bounds for aggregation of density estimators. Bernoulli, 23 0 (1): 0 219--248, 2017","work_id":"f969585b-da22-44b3-b85a-b85fccaa9ca8","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets","work_id":"4f68eada-27e3-437a-a2fe-6e4ca524d0d3","ref_index":3,"cited_arxiv_id":"2311.15127","is_internal_anchor":true},{"doi":"","year":2023,"title":"W., Fidler, S., and Kreis, K","work_id":"957f9e1a-a2b2-431e-bff6-aecffd524f53","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Video generation models as world simulators","work_id":"e020e1af-8964-4aaa-a232-2d7f0d16f6a4","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":70,"snapshot_sha256":"2043db7526c58d7266dfc948a0c5b0411fc12f24e5d1e59aeea3e77a61684eb4","internal_anchors":20},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2502.06764","created_at":"2026-05-17T23:38:47.953274+00:00"},{"alias_kind":"arxiv_version","alias_value":"2502.06764v2","created_at":"2026-05-17T23:38:47.953274+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2502.06764","created_at":"2026-05-17T23:38:47.953274+00:00"},{"alias_kind":"pith_short_12","alias_value":"OVTZMQ4L5I6F","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"OVTZMQ4L5I6FT2X6","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"OVTZMQ4L","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":28,"internal_anchor_count":28,"sample":[{"citing_arxiv_id":"2602.02214","citing_title":"Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22717","citing_title":"Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2602.02214","citing_title":"Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18365","citing_title":"GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2507.07982","citing_title":"Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling","ref_index":64,"is_internal_anchor":true},{"citing_arxiv_id":"2510.26782","citing_title":"Cloning Deterministic Worlds: The Critical Role of Latent Geometry in Long-Horizon World Models","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2512.09928","citing_title":"HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2512.04678","citing_title":"Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation","ref_index":64,"is_internal_anchor":true},{"citing_arxiv_id":"2601.16933","citing_title":"Reward-Forcing: Autoregressive Video Generation with Reward Feedback","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2505.23884","citing_title":"Test-Time Training Done Right","ref_index":63,"is_internal_anchor":true},{"citing_arxiv_id":"2602.02958","citing_title":"Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2602.07775","citing_title":"Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion","ref_index":84,"is_internal_anchor":true},{"citing_arxiv_id":"2602.13669","citing_title":"EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation","ref_index":92,"is_internal_anchor":true},{"citing_arxiv_id":"2603.09721","citing_title":"FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2603.10093","citing_title":"Equivariant Asynchronous Diffusion: An Adaptive Denoising Schedule for Accelerated Molecular Conformation Generation","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01725","citing_title":"Motion-Aware Caching for Efficient Autoregressive Video Generation","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2509.22622","citing_title":"LongLive: Real-time Interactive Long Video Generation","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14487","citing_title":"Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11367","citing_title":"3D-Belief: Embodied Belief Inference via Generative 3D World Modeling","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08729","citing_title":"Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09442","citing_title":"SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19741","citing_title":"CityRAG: Stepping Into a City via Spatially-Grounded Video Generation","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01725","citing_title":"Motion-Aware Caching for Efficient Autoregressive Video Generation","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01896","citing_title":"Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2506.08009","citing_title":"Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion","ref_index":73,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/OVTZMQ4L5I6FT2X6KD52IQZJDF","json":"https://pith.science/pith/OVTZMQ4L5I6FT2X6KD52IQZJDF.json","graph_json":"https://pith.science/api/pith-number/OVTZMQ4L5I6FT2X6KD52IQZJDF/graph.json","events_json":"https://pith.science/api/pith-number/OVTZMQ4L5I6FT2X6KD52IQZJDF/events.json","paper":"https://pith.science/paper/OVTZMQ4L"},"agent_actions":{"view_html":"https://pith.science/pith/OVTZMQ4L5I6FT2X6KD52IQZJDF","download_json":"https://pith.science/pith/OVTZMQ4L5I6FT2X6KD52IQZJDF.json","view_paper":"https://pith.science/paper/OVTZMQ4L","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2502.06764&json=true","fetch_graph":"https://pith.science/api/pith-number/OVTZMQ4L5I6FT2X6KD52IQZJDF/graph.json","fetch_events":"https://pith.science/api/pith-number/OVTZMQ4L5I6FT2X6KD52IQZJDF/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/OVTZMQ4L5I6FT2X6KD52IQZJDF/action/timestamp_anchor","attest_storage":"https://pith.science/pith/OVTZMQ4L5I6FT2X6KD52IQZJDF/action/storage_attestation","attest_author":"https://pith.science/pith/OVTZMQ4L5I6FT2X6KD52IQZJDF/action/author_attestation","sign_citation":"https://pith.science/pith/OVTZMQ4L5I6FT2X6KD52IQZJDF/action/citation_signature","submit_replication":"https://pith.science/pith/OVTZMQ4L5I6FT2X6KD52IQZJDF/action/replication_record"}},"created_at":"2026-05-17T23:38:47.953274+00:00","updated_at":"2026-05-17T23:38:47.953274+00:00"}