SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion

Gao Wang; Jiang Lin; Jizhi Zhang; Mingjie Wang; Qiang Tang; Qian Wang; Shenyi Wang; Song Wu; Xinyu Chen; Yuyi Qian

arxiv: 2605.23245 · v1 · pith:AZEDAHF3new · submitted 2026-05-22 · 💻 cs.CV · cs.AI

SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion

Xinyu Chen , Yuyi Qian , Jiang Lin , Shenyi Wang , Gao Wang , Zhiqiu Zhang , Jizhi Zhang , Mingjie Wang

show 4 more authors

Qiang Tang Qian Wang Song Wu Zili Yi

This is my paper

Pith reviewed 2026-05-25 04:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video object insertiondiffusion modelstraining-free editingspatio-temporal coherenceregional sparse attentionbackground preservation

0 comments

The pith

SimInsert inserts objects into videos by editing one frame and letting image-to-video diffusion models extend the change over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SimInsert, a training-free approach that splits video object insertion into single-frame editing plus a text description of motion. It then relies on the built-in generative knowledge of image-to-video diffusion models to fill in the remaining frames while keeping the background unchanged and allowing natural object-environment interactions. This matters if true because it removes the need for explicit motion engineering or model retraining that limits current methods. A reader would care because the result is higher fidelity without extra resources. The approach uses non-invasive guidance to maintain structure and prevent drift during denoising.

Core claim

SimInsert is a training-free paradigm that decouples video object insertion into intuitive single-frame editing and semantic motion description. It harnesses the generative priors of image-to-video diffusion models to propagate edits temporally, strictly preserving background invariance while enabling plausible, text-driven interactions. Non-invasive guidance mechanisms enforce structural consistency, facilitate seamless boundary fusion, and counteract fidelity drift during the denoising trajectory.

What carries the argument

Non-invasive guidance mechanisms inside image-to-video diffusion models that enforce structural consistency and boundary fusion during denoising while using regional sparse attention fusion.

If this is right

The method produces an 18.8 percent gain in PSNR over prior approaches.
It yields a 20.1 percent improvement in SSIM.
It reduces LPIPS by 44.1 percent.
It supplies a streamlined pipeline for high-fidelity video editing that works on existing diffusion models without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same single-frame-plus-prior strategy could be tested on related tasks such as object removal or attribute change in video.
If the priors already encode plausible interactions, longer or more crowded scenes may require only stronger guidance rather than new training data.
The decoupling into one edited frame plus text motion could reduce annotation effort when adapting the technique to new domains.

Load-bearing premise

The generative priors already present in image-to-video diffusion models are enough to carry a single-frame edit forward in time while keeping the background fixed and producing realistic object interactions.

What would settle it

Apply SimInsert to a video containing an inserted object that must interact with moving background elements; if the background changes or the inserted object shows physically implausible motion across frames, the claim is false.

Figures

Figures reproduced from arXiv: 2605.23245 by Gao Wang, Jiang Lin, Jizhi Zhang, Mingjie Wang, Qiang Tang, Qian Wang, Shenyi Wang, Song Wu, Xinyu Chen, Yuyi Qian, Zhiqiu Zhang, Zili Yi.

**Figure 1.** Figure 1: Qualitative results of SimInsert. The top row displays the edited videos with the inserted objects, while the bottom row shows the corresponding original [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the SimInsert framework. The pipeline integrates first-frame editing, prompt-guided motion propagation, and three core guidance mechanisms—Regional Attention Clone (ReAC), Sparse Attention Fusion, and Latent Refresh—into a pretrained Image-to-Video diffusion model. This architecture enables seamless video object insertion and background preservation without requiring training or manual trajecto… view at source ↗

**Figure 3.** Figure 3: Sparse Attention Fusion mechanism. Left: attention patterns before fusion, showing limited cross-path interactions. Center: randomly sampled sparse fusion pattern. Right: post-fusion attention map, with improved blending of original and edited regions, yielding smoother spatial and temporal coherence. B. Sparse Attention Fusion While Regional Attention Clone effectively preserves background content, simp… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison. Top: Original/Input video. Middle: The strongest baseline, AnyV2V. Bottom: SimInsert (Ours). [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Video object insertion requires ensuring spatio-temporal coherence and interactive realism, extending far beyond simple content placement. However, current approaches are often hindered by a reliance on explicit motion engineering or resource-intensive retraining, restricting their flexibility and generalization. To bridge this gap, we present \textit{SimInsert}, a training-free paradigm that efficiently decouples the task into intuitive single-frame editing and semantic motion description. By harnessing the robust generative priors of image-to-video diffusion models, SimInsert propagates edits temporally, strictly preserving background invariance while enabling plausible, text-driven interactions between the inserted object and the dynamic environment. Our approach hinges on non-invasive guidance mechanisms that enforce structural consistency, facilitate seamless boundary fusion, and counteract the fidelity drift that typically accumulates during the denoising trajectory. Extensive quantitative experiments validate our efficacy: SimInsert surpasses state-of-the-art methods with an 18.8\% gain in PSNR, 20.1\% in SSIM, and a 44.1\% decrease in LPIPS, offering a streamlined solution for high-fidelity video editing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SimInsert offers a training-free split of video object insertion into one-frame editing plus text motion guidance on image-to-video diffusion models, with reported metric gains that need full experimental backing to judge.

read the letter

The main thing here is a training-free approach that edits a single frame and then uses a semantic text description to drive object motion through an existing image-to-video diffusion model, with regional sparse attention fusion to handle blending and consistency. This avoids retraining or building explicit motion models, which keeps it flexible for different videos. The non-invasive guidance steps to enforce structure and limit drift during denoising are a reasonable way to use the model's priors for temporal spread while aiming to leave the background untouched. If the experiments confirm the numbers, the 18.8% PSNR, 20.1% SSIM, and 44.1% LPIPS improvements would be a practical step for editing workflows. The decoupling itself is a clear framing that matches the problem description without obvious internal contradictions. The soft spots are the missing specifics on datasets, exact baselines, and how the comparisons were run, which makes it hard to gauge how much the gains depend on the new components versus the base model. The full paper would need clear ablations and failure-case examples to show the method holds when interactions get tricky or backgrounds have motion. This is for people working on generative video tools who want plug-in options rather than custom training pipelines. A reader already using diffusion models for editing could test the idea quickly. I would send it for peer review because the claims are concrete enough to evaluate and the problem matters, even if revisions will be needed on the evidence side.

Referee Report

2 major / 1 minor

Summary. The paper introduces SimInsert, a training-free paradigm for video object insertion that decouples the task into single-frame editing plus text-based semantic motion description. It leverages generative priors from image-to-video diffusion models to propagate edits temporally while enforcing background invariance and plausible object-environment interactions via non-invasive guidance and regional sparse attention fusion. The central claim is that this yields state-of-the-art results, with reported gains of 18.8% in PSNR, 20.1% in SSIM, and 44.1% reduction in LPIPS over prior methods.

Significance. If the quantitative claims are substantiated with full experimental details, the work would offer a meaningful contribution by demonstrating that diffusion-model priors can handle temporal propagation and interaction realism without retraining or explicit motion modeling, potentially simplifying high-fidelity video editing pipelines.

major comments (2)

[Abstract] Abstract: the reported metric gains (18.8% PSNR, 20.1% SSIM, 44.1% LPIPS) are presented without any reference to experimental setup, baselines, datasets, number of videos, or evaluation protocol. This absence directly undermines assessment of the central claim that SimInsert surpasses SOTA methods.
[Method] Method description: the mechanisms labeled 'regional sparse attention fusion' and 'non-invasive guidance' are described at a high level without equations, pseudocode, or precise definitions of how background invariance is strictly enforced or how fidelity drift is counteracted during denoising. These details are load-bearing for reproducibility and for validating the training-free assertion.

minor comments (1)

Ensure the full manuscript includes ablation studies isolating the contribution of sparse attention versus guidance, plus qualitative examples of failure cases (e.g., complex interactions or fast motion).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the two major comments point by point below and commit to revisions that directly resolve the identified gaps in clarity and reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract: the reported metric gains (18.8% PSNR, 20.1% SSIM, 44.1% LPIPS) are presented without any reference to experimental setup, baselines, datasets, number of videos, or evaluation protocol. This absence directly undermines assessment of the central claim that SimInsert surpasses SOTA methods.

Authors: We agree that the abstract should supply immediate context for the quantitative claims. In the revised manuscript we will expand the abstract to state the evaluation datasets, number of test videos, comparison baselines, and protocol (e.g., frame-wise and video-level metrics) while remaining within length limits. This change will allow readers to assess the reported gains without needing to consult the main text. revision: yes
Referee: [Method] Method description: the mechanisms labeled 'regional sparse attention fusion' and 'non-invasive guidance' are described at a high level without equations, pseudocode, or precise definitions of how background invariance is strictly enforced or how fidelity drift is counteracted during denoising. These details are load-bearing for reproducibility and for validating the training-free assertion.

Authors: We acknowledge that the current Method section presents these components conceptually. To improve reproducibility we will add (i) the mathematical formulation of regional sparse attention fusion, (ii) pseudocode for the full inference pipeline, and (iii) explicit definitions of the non-invasive guidance terms that enforce background invariance and counteract fidelity drift. These additions will be inserted into the revised Method section. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes a training-free method that decouples single-frame editing from temporal propagation using image-to-video diffusion priors and regional sparse attention fusion. No equations, fitted parameters, or derivations are presented in the abstract or method outline that reduce by construction to the inputs. Claims rest on experimental metric improvements rather than self-referential predictions or self-citation chains. The central premise is internally consistent with the stated non-invasive guidance without load-bearing reductions to prior author work or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method relies on existing diffusion model priors as background.

pith-pipeline@v0.9.0 · 5744 in / 1007 out tokens · 20630 ms · 2026-05-25T04:44:54.486585+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 6 internal anchors

[1]

Video-p2p: Video editing with cross- attention control,

S. Liu, Y . Zhang, W. Liet al., “Video-p2p: Video editing with cross- attention control,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 8599–8608

work page 2024
[2]

Dreamvideo: Composing your dream videos with customized subject and motion,

Y . Wei, S. Zhang, Z. Qinget al., “Dreamvideo: Composing your dream videos with customized subject and motion,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 6537–6549

work page 2024
[3]

Videoswap: Customized video subject swapping with interactive semantic point correspondence,

Y . Gu, Y . Zhou, B. Wuet al., “Videoswap: Customized video subject swapping with interactive semantic point correspondence,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 7621–7630

work page 2024
[4]

Token Merging for Training- Free Semantic Binding in Text-to-Image Synthesis,

T. Hu, L. Li, J. v. d. Weijeret al., “Token Merging for Training- Free Semantic Binding in Text-to-Image Synthesis,”arXiv preprint arXiv:2411.07132, 2024

work page arXiv 2024
[5]

InstructPix2Pix: Learning to Follow Image Editing Instructions,

T. Brooks, A. Holynski, and A. A. Efros, “InstructPix2Pix: Learning to Follow Image Editing Instructions,”arXiv preprint arXiv:2211.09800, 2023

work page arXiv 2023
[6]

OminiControl: Minimal and Universal Control for Diffusion Transformer,

Z. Tan, S. Liu, X. Yanget al., “OminiControl: Minimal and Universal Control for Diffusion Transformer,”arXiv preprint arXiv:2411.15098, 2024

work page arXiv 2024
[7]

Prompt-to-Prompt Image Editing with Cross Attention Control

A. Hertz, R. Mokady, J. Tenenbaumet al., “Prompt-to-Prompt Image Editing with Cross Attention Control,”arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

FateZero: Fusing Attentions for Zero- shot Text-based Video Editing,

C. Qi, X. Cun, Y . Zhanget al., “FateZero: Fusing Attentions for Zero- shot Text-based Video Editing,”arXiv preprint arXiv:2303.09535, 2023

work page arXiv 2023
[9]

Re-Attentional Controllable Video Diffusion Editing,

Y . Wang, Y . Li, M. Liuet al., “Re-Attentional Controllable Video Diffusion Editing,”arXiv preprint arXiv:2412.11710, 2024

work page arXiv 2024
[10]

Revideo: Remake a video with motion and content control,

C. Mou, M. Cao, X. Wanget al., “Revideo: Remake a video with motion and content control,”Adv. Neural Inf. Process. Syst., vol. 37, pp. 18 481– 18 505, 2024

work page 2024
[11]

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control,

Y . Tu, H. Luo, X. Chenet al., “VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control,”arXiv preprint arXiv:2501.01427, 2025

work page arXiv 2025
[12]

Mvoc: a training-free multiple video object composition method with diffusion models,

W. Wang, Y . Chen, Y . Liuet al., “Mvoc: a training-free multiple video object composition method with diffusion models,”arXiv preprint arXiv:2406.15829, 2024

work page arXiv 2024
[13]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Z. Yang, J. Teng, W. Zhenget al., “CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer,”arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Wan: Open and Advanced Large-Scale Video Generative Models

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Objectmover: Generative object movement with video prior,

X. Yu, T. Wang, S. Y . Kimet al., “Objectmover: Generative object movement with video prior,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2025, pp. 17 682–17 691

work page 2025
[16]

Pix2video: Video editing using image diffusion,

D. Ceylan, C.-H. P. Huang, and N. J. Mitra, “Pix2video: Video editing using image diffusion,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 23 206–23 217

work page 2023
[17]

arXiv preprint arXiv:2403.14468 , year=

M. Ku, C. Wei, W. Renet al., “AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks,”arXiv preprint arXiv:2403.14468, 2024

work page arXiv 2024
[18]

I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models,

W. Ouyang, Y . Dong, L. Yanget al., “I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models,”arXiv preprint arXiv:2405.16537, 2024

work page arXiv 2024
[19]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Q. Chen, H. Ben-Hamuet al., “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

The 2017 DAVIS Challenge on Video Object Segmentation

J. Pont-Tuset, F. Perazzi, S. Caelleset al., “The 2017 davis challenge on video object segmentation,”arXiv preprint arXiv:1704.00675, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

arXiv preprint arXiv:2505.24873 , year=

B. Zi, W. Peng, X. Qiet al., “Minimax-remover: Taming bad noise helps video object removal,”arXiv preprint arXiv:2505.24873, 2025

work page arXiv 2025
[22]

B. F. Labs, “Flux,” https://github.com/black-forest-labs/flux, 2024

work page 2024
[23]

LTX-Video: Realtime Video Latent Diffusion

Y . HaCohen, N. Chiprut, B. Brazowski, D. Shalem, D. Moshe, E. Richardson, E. Levin, G. Shiran, N. Zabari, O. Gordon et al., “Ltx-video: Realtime video latent diffusion,”arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

ConsistI2V: Enhancing Vi- sual Consistency for Image-to-Video Generation,

W. Ren, H. Yang, G. Zhanget al., “ConsistI2V: Enhancing Vi- sual Consistency for Image-to-Video Generation,”arXiv preprint arXiv:2402.04324, 2024

work page arXiv 2024

[1] [1]

Video-p2p: Video editing with cross- attention control,

S. Liu, Y . Zhang, W. Liet al., “Video-p2p: Video editing with cross- attention control,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 8599–8608

work page 2024

[2] [2]

Dreamvideo: Composing your dream videos with customized subject and motion,

Y . Wei, S. Zhang, Z. Qinget al., “Dreamvideo: Composing your dream videos with customized subject and motion,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 6537–6549

work page 2024

[3] [3]

Videoswap: Customized video subject swapping with interactive semantic point correspondence,

Y . Gu, Y . Zhou, B. Wuet al., “Videoswap: Customized video subject swapping with interactive semantic point correspondence,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 7621–7630

work page 2024

[4] [4]

Token Merging for Training- Free Semantic Binding in Text-to-Image Synthesis,

T. Hu, L. Li, J. v. d. Weijeret al., “Token Merging for Training- Free Semantic Binding in Text-to-Image Synthesis,”arXiv preprint arXiv:2411.07132, 2024

work page arXiv 2024

[5] [5]

InstructPix2Pix: Learning to Follow Image Editing Instructions,

T. Brooks, A. Holynski, and A. A. Efros, “InstructPix2Pix: Learning to Follow Image Editing Instructions,”arXiv preprint arXiv:2211.09800, 2023

work page arXiv 2023

[6] [6]

OminiControl: Minimal and Universal Control for Diffusion Transformer,

Z. Tan, S. Liu, X. Yanget al., “OminiControl: Minimal and Universal Control for Diffusion Transformer,”arXiv preprint arXiv:2411.15098, 2024

work page arXiv 2024

[7] [7]

Prompt-to-Prompt Image Editing with Cross Attention Control

A. Hertz, R. Mokady, J. Tenenbaumet al., “Prompt-to-Prompt Image Editing with Cross Attention Control,”arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

FateZero: Fusing Attentions for Zero- shot Text-based Video Editing,

C. Qi, X. Cun, Y . Zhanget al., “FateZero: Fusing Attentions for Zero- shot Text-based Video Editing,”arXiv preprint arXiv:2303.09535, 2023

work page arXiv 2023

[9] [9]

Re-Attentional Controllable Video Diffusion Editing,

Y . Wang, Y . Li, M. Liuet al., “Re-Attentional Controllable Video Diffusion Editing,”arXiv preprint arXiv:2412.11710, 2024

work page arXiv 2024

[10] [10]

Revideo: Remake a video with motion and content control,

C. Mou, M. Cao, X. Wanget al., “Revideo: Remake a video with motion and content control,”Adv. Neural Inf. Process. Syst., vol. 37, pp. 18 481– 18 505, 2024

work page 2024

[11] [11]

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control,

Y . Tu, H. Luo, X. Chenet al., “VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control,”arXiv preprint arXiv:2501.01427, 2025

work page arXiv 2025

[12] [12]

Mvoc: a training-free multiple video object composition method with diffusion models,

W. Wang, Y . Chen, Y . Liuet al., “Mvoc: a training-free multiple video object composition method with diffusion models,”arXiv preprint arXiv:2406.15829, 2024

work page arXiv 2024

[13] [13]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Z. Yang, J. Teng, W. Zhenget al., “CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer,”arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Wan: Open and Advanced Large-Scale Video Generative Models

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Objectmover: Generative object movement with video prior,

X. Yu, T. Wang, S. Y . Kimet al., “Objectmover: Generative object movement with video prior,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2025, pp. 17 682–17 691

work page 2025

[16] [16]

Pix2video: Video editing using image diffusion,

D. Ceylan, C.-H. P. Huang, and N. J. Mitra, “Pix2video: Video editing using image diffusion,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 23 206–23 217

work page 2023

[17] [17]

arXiv preprint arXiv:2403.14468 , year=

M. Ku, C. Wei, W. Renet al., “AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks,”arXiv preprint arXiv:2403.14468, 2024

work page arXiv 2024

[18] [18]

I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models,

W. Ouyang, Y . Dong, L. Yanget al., “I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models,”arXiv preprint arXiv:2405.16537, 2024

work page arXiv 2024

[19] [19]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Q. Chen, H. Ben-Hamuet al., “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

The 2017 DAVIS Challenge on Video Object Segmentation

J. Pont-Tuset, F. Perazzi, S. Caelleset al., “The 2017 davis challenge on video object segmentation,”arXiv preprint arXiv:1704.00675, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2017

[21] [21]

arXiv preprint arXiv:2505.24873 , year=

B. Zi, W. Peng, X. Qiet al., “Minimax-remover: Taming bad noise helps video object removal,”arXiv preprint arXiv:2505.24873, 2025

work page arXiv 2025

[22] [22]

B. F. Labs, “Flux,” https://github.com/black-forest-labs/flux, 2024

work page 2024

[23] [23]

LTX-Video: Realtime Video Latent Diffusion

Y . HaCohen, N. Chiprut, B. Brazowski, D. Shalem, D. Moshe, E. Richardson, E. Levin, G. Shiran, N. Zabari, O. Gordon et al., “Ltx-video: Realtime video latent diffusion,”arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

ConsistI2V: Enhancing Vi- sual Consistency for Image-to-Video Generation,

W. Ren, H. Yang, G. Zhanget al., “ConsistI2V: Enhancing Vi- sual Consistency for Image-to-Video Generation,”arXiv preprint arXiv:2402.04324, 2024

work page arXiv 2024