{"paper":{"title":"TokenFlow: Consistent Diffusion Features for Consistent Video Editing","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Enforcing consistency among diffusion features across frames yields temporally coherent text-driven video edits.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Michal Geyer, Omer Bar-Tal, Shai Bagon, Tali Dekel","submitted_at":"2023-07-19T18:00:03Z","abstract_excerpt":"The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key obse"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space... by explicitly propagating diffusion features based on inter-frame correspondences","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That propagating diffusion features according to inter-frame correspondences will produce spatially and temporally consistent edits without introducing new artifacts or breaking the text conditioning.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"TokenFlow produces consistent text-driven video edits by propagating diffusion features according to inter-frame correspondences extracted from the source video.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Enforcing consistency among diffusion features across frames yields temporally coherent text-driven video edits.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"4ccb249b00a6a7e3e13d95681f3b8f30d3b6fafe43474d8898a78c5199fb930d"},"source":{"id":"2307.10373","kind":"arxiv","version":3},"verdict":{"id":"7f5d009a-ce61-46d1-9840-926ab35d5e64","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T20:12:21.562760Z","strongest_claim":"consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space... by explicitly propagating diffusion features based on inter-frame correspondences","one_line_summary":"TokenFlow produces consistent text-driven video edits by propagating diffusion features according to inter-frame correspondences extracted from the source video.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That propagating diffusion features according to inter-frame correspondences will produce spatially and temporally consistent edits without introducing new artifacts or breaking the text conditioning.","pith_extraction_headline":"Enforcing consistency among diffusion features across frames yields temporally coherent text-driven video edits."},"references":{"count":27,"sample":[{"doi":"","year":null,"title":"Multidiffusion: Fusing diffusion paths for controlled image generation","work_id":"1a06e9b7-e97f-4665-ba6e-55301822d3b6","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Pix2video: Video editing using image diffusion","work_id":"9a64fd9d-fb53-46a1-8edb-808f1a5cd594","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models","work_id":"92bbf1bb-4319-45c4-845c-523bfc2cb97b","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Diffusion models in vision: A survey.arXiv e-prints, abs/2209.04747","work_id":"46a3884e-d8d0-426d-873e-9096d96da38d","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Structure and content-guided video synthesis with diffusion models","work_id":"9ec84ad8-517d-4d27-ba1a-447562d63988","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":27,"snapshot_sha256":"1458884d8ae3174bd27a627983cb5513060573d2271a44c8fca28fb9a36dc743","internal_anchors":7},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}