TokenFlow: Consistent Diffusion Features for Consistent Video Editing

arxiv: 2307.10373 · v3 · pith:4ZVQJD6Unew · submitted 2023-07-19 · 💻 cs.CV

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Michal Geyer , Omer Bar-Tal , Shai Bagon , Tali Dekel This is my paper

Pith reviewed 2026-05-17 20:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords video editingdiffusion modelstext-to-videofeature consistencytemporal coherenceimage-to-video editingfeature propagation

0 comments p. Extension

pith:4ZVQJD6U Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{4ZVQJD6U}

Prints a linked pith:4ZVQJD6U badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Enforcing consistency among diffusion features across frames yields temporally coherent text-driven video edits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that video edits can stay consistent in space and time by making the underlying diffusion features consistent as well. It does this by taking the features from a text-to-image diffusion model and propagating them from frame to frame using the motion correspondences that the model already computes. A reader would care because this turns any good image editor into a video editor without needing to train a new model or collect video data.

Core claim

The central discovery is that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. This is achieved by explicitly propagating diffusion features based on inter-frame correspondences that are readily available in the model. The method requires no training or fine-tuning and works with any off-the-shelf text-to-image editing technique.

What carries the argument

TokenFlow, the mechanism that propagates diffusion features (tokens) across video frames according to inter-frame correspondences to enforce consistency in the feature space while applying text-driven edits.

If this is right

High-quality video edits that preserve the input video's spatial layout and motion.
Compatibility with any existing text-to-image editing method without modification.
State-of-the-art results on real-world videos for text-driven editing tasks.
No need for training or fine-tuning on video data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could extend to other video generation tasks if similar feature correspondences are available.
Similar propagation ideas might improve consistency in other generative models beyond diffusion.
By avoiding video-specific training, it lowers the barrier for experimenting with video editing techniques.

Load-bearing premise

That aligning and moving the diffusion features from one frame to the next according to their natural correspondences will keep the edits faithful to the text prompt and free of new visual artifacts.

What would settle it

A video where propagating the features according to inter-frame motion still produces flickering, blurring, or loss of the original motion pattern in the output.

read the original abstract

The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. Webpage: https://diffusion-tokenflow.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents TokenFlow, a training-free framework for text-driven video editing that applies a pre-trained text-to-image diffusion model to a source video and target prompt. The central mechanism computes inter-frame token correspondences once on the source video's diffusion features and then propagates the edited per-frame features along those fixed mappings to enforce temporal consistency while preserving layout and motion. The method is designed to be compatible with any off-the-shelf image editing technique and is evaluated through qualitative demonstrations on real-world videos, with the claim of state-of-the-art results.

Significance. If the consistency mechanism holds under realistic editing conditions, the work is significant because it offers a practical, training-free route to extend high-quality image diffusion models to video without requiring video-specific fine-tuning or large-scale video datasets. The explicit use of diffusion-feature correspondences already present in the model is a clean design choice that avoids additional learned components. The paper supplies qualitative evidence across diverse videos, but the absence of quantitative metrics and targeted robustness tests limits the strength of the significance assessment.

major comments (2)

[§3] §3 (Method): The propagation of edited features along source-derived correspondences (described after Eq. (3) or equivalent) is load-bearing for the consistency claim, yet the manuscript provides no analysis or ablation showing that these correspondences remain semantically valid once the features have been altered by a text-driven edit. Large prompt changes that modify shape, identity, or layout can invalidate the original geometry, risking misalignment or broken text conditioning; this assumption is not tested.
[§4] §4 (Experiments): The evaluation consists solely of qualitative examples and visual comparisons. No quantitative metrics (e.g., temporal consistency scores, CLIP-based text alignment, or user studies), ablation studies on correspondence quality, or error analysis under varying prompt strengths are reported. This absence directly weakens the “state-of-the-art” claim and the assertion that the method works “under varied conditions.”

minor comments (2)

[Abstract] Abstract: The sentence “readily available in the model” is vague; a brief parenthetical clarifying that correspondences are extracted from the U-Net attention maps or token similarities during the source inversion pass would improve clarity.
[Figures] Figure 3 and 4 captions: Adding the exact source and target prompts used for each row would help readers reproduce and interpret the qualitative results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and describe the revisions we will incorporate to strengthen the paper.

read point-by-point responses

Referee: [§3] §3 (Method): The propagation of edited features along source-derived correspondences (described after Eq. (3) or equivalent) is load-bearing for the consistency claim, yet the manuscript provides no analysis or ablation showing that these correspondences remain semantically valid once the features have been altered by a text-driven edit. Large prompt changes that modify shape, identity, or layout can invalidate the original geometry, risking misalignment or broken text conditioning; this assumption is not tested.

Authors: We appreciate the referee highlighting the importance of validating the semantic stability of source-derived correspondences after editing. The correspondences are computed from the source video's diffusion features, which capture both low-level structure and higher-level semantics through the denoising process. Edits are applied in feature space while the mappings remain fixed to enforce consistency. Although the manuscript supports this through extensive qualitative results on diverse real-world videos with varying degrees of prompt change, we agree that dedicated analysis would strengthen the claim. In the revised manuscript we will add a new subsection discussing this assumption, including visualizations of correspondence maps before and after large edits and a targeted ablation that compares propagation versus independent per-frame editing on sequences involving shape or identity modifications. revision: yes
Referee: [§4] §4 (Experiments): The evaluation consists solely of qualitative examples and visual comparisons. No quantitative metrics (e.g., temporal consistency scores, CLIP-based text alignment, or user studies), ablation studies on correspondence quality, or error analysis under varying prompt strengths are reported. This absence directly weakens the “state-of-the-art” claim and the assertion that the method works “under varied conditions.”

Authors: We acknowledge that the current evaluation is primarily qualitative and that quantitative metrics would provide additional support for the state-of-the-art claim. Our focus on visual comparisons stems from the fact that temporal consistency and perceptual quality in video editing are most reliably judged by direct inspection, especially given the training-free nature of the method. To address the referee's concern, the revised version will include quantitative evaluations: temporal consistency scores computed via optical-flow warping error, CLIP-based text-alignment scores averaged over frames, and a user study with preference ratings for consistency and fidelity. We will also add ablations on correspondence quality and error analysis across different prompt strengths and editing magnitudes. revision: yes

Circularity Check

0 steps flagged

No circularity: consistency mechanism uses pre-trained model correspondences without reduction to fitted inputs or self-definition

full rationale

The paper's core claim is that video editing consistency follows from propagating diffusion features along inter-frame correspondences extracted from the source video. These correspondences are computed directly from the off-the-shelf text-to-image diffusion model applied to the input frames and are not fitted or optimized against the edited output. No equation defines the final edited video in terms of itself, renames a known result, or relies on a load-bearing self-citation whose validity is assumed rather than independently verified. The method is presented as a training-free post-processing step that can be combined with any external editing technique, keeping the derivation self-contained against external model behavior rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the domain assumption that diffusion features carry layout and motion information that can be transferred across frames, plus the standard assumption that off-the-shelf text-to-image diffusion models already encode usable inter-frame correspondences. No free parameters or new invented entities are introduced in the abstract description.

axioms (2)

domain assumption Diffusion features encode spatial layout and motion information that remains useful when propagated across frames
Invoked in the key observation that consistency in feature space yields consistent video edits.
domain assumption Inter-frame correspondences are readily available inside the diffusion model without extra computation
Stated directly in the abstract as the basis for feature propagation.

pith-pipeline@v0.9.0 · 5485 in / 1304 out tokens · 28596 ms · 2026-05-17T20:12:21.562760+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Physics-Aware Video Instance Removal Benchmark
cs.CV 2026-04 unverdicted novelty 7.0

The PVIR benchmark tests video object removal on physical consistency using 95 annotated videos and shows that existing methods struggle with complex interactions like lingering shadows.
Token Warping Helps MLLMs Look from Nearby Viewpoints
cs.CV 2026-04 unverdicted novelty 7.0

Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
cs.CV 2025-12 unverdicted novelty 7.0

LivingSwap is the first video reference-guided face swapping model that uses keyframe conditioning and temporal stitching to preserve source video realism with high fidelity across long sequences.
ASTRA: Let Arbitrary Subjects Transform in Video Editing
cs.CV 2025-10 unverdicted novelty 7.0

ASTRA is a plug-and-play training-free method for precise multi-subject video editing that uses prompt-guided multimodal alignment and prior-based mask retargeting to avoid attention dilution and boundary issues.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
cs.CV 2026-05 unverdicted novelty 6.0

V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model
cs.CV 2026-04 unverdicted novelty 6.0

M-GDM uses motion vectors and frame types to guide a diffusion model in blind recovery of bitstream-corrupted videos without manual masks.
DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
cs.CV 2026-04 unverdicted novelty 6.0

RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and ...
InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
cs.CV 2026-04 unverdicted novelty 6.0

InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.
Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation
cs.CV 2025-11 unverdicted novelty 6.0

Fine-tuning text-to-video models on sparse low-quality synthetic data for physical camera controls outperforms fine-tuning on photorealistic data.
Text Slider: Efficient and Plug-and-Play Continuous Concept Control for Image/Video Synthesis via LoRA Adapters
cs.GR 2025-09 unverdicted novelty 6.0

Text Slider uses LoRA adapters on pre-trained text encoders to identify low-rank directions for efficient, plug-and-play continuous concept control in diffusion-based image and video synthesis.
TaleDiffusion: Multi-Character Story Generation with Dialogue Rendering
cs.CV 2025-09 unverdicted novelty 6.0

TaleDiffusion introduces an iterative framework using LLM-generated per-frame descriptions, bounded attention-based per-box masks, identity-consistent self-attention, region-aware cross-attention, and CLIPSeg-based di...
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
cs.CV 2025-03 accept novelty 6.0

VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
VideoPoet: A Large Language Model for Zero-Shot Video Generation
cs.CV 2023-12 unverdicted novelty 6.0

VideoPoet is a large language model that performs zero-shot video generation with audio from diverse multimodal conditioning signals.
Context Unrolling in Omni Models
cs.CV 2026-04 unverdicted novelty 5.0

Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
Controllable Video Object Insertion via Multiview Priors
cs.CV 2026-04 unverdicted novelty 5.0

A multi-view prior-based framework for video object insertion that uses dual-path conditioning and an integration-aware consistency module to improve appearance stability and occlusion handling.
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training
cs.AI 2026-04 unverdicted novelty 4.0

TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
cs.CV 2026-04 unverdicted novelty 3.0

This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 19 Pith papers · 7 internal anchors

[1]

Multidiffusion: Fusing diffusion paths for controlled image generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113,

work page arXiv
[2]

Pix2video: Video editing using image diffusion

Duygu Ceylan, Chun-Hao Paul Huang, and Niloy Jyoti Mitra. Pix2video: Video editing using image diffusion. ArXiv, abs/2303.12688,

work page arXiv
[3]

Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. arXiv preprint arXiv:2301.13826,

work page arXiv
[4]

Diffusion models in vision: A survey

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. arXiv preprint arXiv:2209.04747,

work page arXiv
[5]

Structure and content-guided video synthesis with diffusion models.arXiv preprint arXiv:2302.03011,

Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011,

work page arXiv
[6]

Scenescape: Text-driven consistent scene gener- ation

Rafail Fridman, Amit Abecasis, Yoni Kasten, and Tali Dekel. Scenescape: Text-driven consistent scene gener- ation. arXiv preprint arXiv:2302.01133,

work page arXiv
[7]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a. Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Flee...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Text2video-zero: Text-to-image diffusion models are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. ArXiv, abs/2303.13439, 2023a. Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Hum...

work page arXiv
[10]

Learning blind video temporal consistency

Wei-Sheng Lai, Jia-Bin Huang, Oliver Wang, Eli Shechtman, Ersin Yumer, and Ming-Hsuan Yang. Learning blind video temporal consistency. In European Conference on Computer Vision, 2018a. Wei-Sheng Lai, Jia-Bin Huang, Oliver Wang, Eli Shechtman, Ersin Yumer, and Ming-Hsuan Yang. Learning blind video temporal consistency. In Proceedings of the European confer...

work page arXiv
[11]

Video-p2p: Video editing with cross- attention control

Shaoteng Liu, Yuecheng Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross- attention control. ArXiv, abs/2303.04761,

work page arXiv
[12]

Directed diffusion: Direct control of object placement through attention guidance

Wan-Duo Kurt Ma, JP Lewis, W Bastiaan Kleijn, and Thomas Leung. Directed diffusion: Direct control of object placement through attention guidance. arXiv preprint arXiv:2302.13153,

work page arXiv
[13]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Localizing object-level shape variations with text-to-image diffusion models

Or Patashnik, Daniel Garibi, Idan Azuri, Hadar Averbuch-Elor, and Daniel Cohen-Or. Localizing object-level shape variations with text-to-image diffusion models. arXiv preprint arXiv:2303.11306,

work page arXiv
[15]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbel´aez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675,

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

Fatezero: Fusing attentions for zero-shot text-based video editing, 2023

Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv:2303.09535,

work page arXiv
[17]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to- image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

LAION-5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large- scale dataset for training next generation image-text models...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Space-time super-resolution from a single video

Oded Shahar, Alon Faktor, and Michal Irani. Space-time super-resolution from a single video. In CVPR 2011,

work page 2011
[21]

Knn-diffusion: Image generation via large-scale retrieval

Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Nachmani, and Yaniv Taigman. Knn-diffusion: Image generation via large-scale retrieval. arXiv preprint arXiv:2204.02849,

work page arXiv
[22]

Edit-a-video: Single video editing with object-aware consistency

Chaehun Shin, Heeseung Kim, Che Hyun Lee, Sang gil Lee, and Sung-Hoon Yoon. Edit-a-video: Single video editing with object-aware consistency. ArXiv, abs/2303.07945,

work page arXiv
[23]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16 . Springer,

work page 2020
[24]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565,

work page arXiv
[25]

Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5908–5916,

work page 2017
[26]

A tale of two features: Stable diffusion complements dino for zero-shot semantic correspon- dence

Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, and Ming- Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspon- dence. arXiv preprint arxiv:2305.15347,

work page arXiv
[27]

TA V Text2video-zero Rerender-a-video fatezero PnP ours (preprocess) ours (sampling) ours (total) 2684 198 285 349 208 50 187 237 We provide additional implementation details below

12 Table 3: We report average runtime in seconds, of running ours and competing methods on a video of 40 frames. TA V Text2video-zero Rerender-a-video fatezero PnP ours (preprocess) ours (sampling) ours (total) 2684 198 285 349 208 50 187 237 We provide additional implementation details below. We refer the reader to the HTML file attached to our Supplemen...

work page 2023

[1] [1]

Multidiffusion: Fusing diffusion paths for controlled image generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113,

work page arXiv

[2] [2]

Pix2video: Video editing using image diffusion

Duygu Ceylan, Chun-Hao Paul Huang, and Niloy Jyoti Mitra. Pix2video: Video editing using image diffusion. ArXiv, abs/2303.12688,

work page arXiv

[3] [3]

Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. arXiv preprint arXiv:2301.13826,

work page arXiv

[4] [4]

Diffusion models in vision: A survey

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. arXiv preprint arXiv:2209.04747,

work page arXiv

[5] [5]

Structure and content-guided video synthesis with diffusion models.arXiv preprint arXiv:2302.03011,

Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011,

work page arXiv

[6] [6]

Scenescape: Text-driven consistent scene gener- ation

Rafail Fridman, Amit Abecasis, Yoni Kasten, and Tali Dekel. Scenescape: Text-driven consistent scene gener- ation. arXiv preprint arXiv:2302.01133,

work page arXiv

[7] [7]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a. Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Flee...

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Text2video-zero: Text-to-image diffusion models are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. ArXiv, abs/2303.13439, 2023a. Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Hum...

work page arXiv

[10] [10]

Learning blind video temporal consistency

Wei-Sheng Lai, Jia-Bin Huang, Oliver Wang, Eli Shechtman, Ersin Yumer, and Ming-Hsuan Yang. Learning blind video temporal consistency. In European Conference on Computer Vision, 2018a. Wei-Sheng Lai, Jia-Bin Huang, Oliver Wang, Eli Shechtman, Ersin Yumer, and Ming-Hsuan Yang. Learning blind video temporal consistency. In Proceedings of the European confer...

work page arXiv

[11] [11]

Video-p2p: Video editing with cross- attention control

Shaoteng Liu, Yuecheng Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross- attention control. ArXiv, abs/2303.04761,

work page arXiv

[12] [12]

Directed diffusion: Direct control of object placement through attention guidance

Wan-Duo Kurt Ma, JP Lewis, W Bastiaan Kleijn, and Thomas Leung. Directed diffusion: Direct control of object placement through attention guidance. arXiv preprint arXiv:2302.13153,

work page arXiv

[13] [13]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Localizing object-level shape variations with text-to-image diffusion models

Or Patashnik, Daniel Garibi, Idan Azuri, Hadar Averbuch-Elor, and Daniel Cohen-Or. Localizing object-level shape variations with text-to-image diffusion models. arXiv preprint arXiv:2303.11306,

work page arXiv

[15] [15]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbel´aez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675,

work page internal anchor Pith review Pith/arXiv arXiv 2017

[16] [16]

Fatezero: Fusing attentions for zero-shot text-based video editing, 2023

Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv:2303.09535,

work page arXiv

[17] [17]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to- image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

LAION-5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large- scale dataset for training next generation image-text models...

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Space-time super-resolution from a single video

Oded Shahar, Alon Faktor, and Michal Irani. Space-time super-resolution from a single video. In CVPR 2011,

work page 2011

[21] [21]

Knn-diffusion: Image generation via large-scale retrieval

Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Nachmani, and Yaniv Taigman. Knn-diffusion: Image generation via large-scale retrieval. arXiv preprint arXiv:2204.02849,

work page arXiv

[22] [22]

Edit-a-video: Single video editing with object-aware consistency

Chaehun Shin, Heeseung Kim, Che Hyun Lee, Sang gil Lee, and Sung-Hoon Yoon. Edit-a-video: Single video editing with object-aware consistency. ArXiv, abs/2303.07945,

work page arXiv

[23] [23]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16 . Springer,

work page 2020

[24] [24]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565,

work page arXiv

[25] [25]

Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5908–5916,

work page 2017

[26] [26]

A tale of two features: Stable diffusion complements dino for zero-shot semantic correspon- dence

Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, and Ming- Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspon- dence. arXiv preprint arxiv:2305.15347,

work page arXiv

[27] [27]

TA V Text2video-zero Rerender-a-video fatezero PnP ours (preprocess) ours (sampling) ours (total) 2684 198 285 349 208 50 187 237 We provide additional implementation details below

12 Table 3: We report average runtime in seconds, of running ours and competing methods on a video of 40 frames. TA V Text2video-zero Rerender-a-video fatezero PnP ours (preprocess) ours (sampling) ours (total) 2684 198 285 349 208 50 187 237 We provide additional implementation details below. We refer the reader to the HTML file attached to our Supplemen...

work page 2023