pith. sign in

arxiv: 2605.24674 · v1 · pith:EIA6XK5Mnew · submitted 2026-05-23 · 💻 cs.CV

Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing

Pith reviewed 2026-06-30 13:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords video editingdiffusion transformerinstruction-based editingattention alignmenttoken routingimplicit reasoningmutual information
0
0 comments X

The pith

RVEDiT induces implicit reasoning in diffusion transformers by routing LLM-derived editing tokens to shallow blocks and aligning attention features with a reference branch during training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims existing DiT video editors mix global intent and local evidence across all layers while supervising attention only through pixel loss, which leaves reasoning under-constrained. It proposes that routing specialized editing tokens early and maximizing mutual information between editing and reference attention maps during training will produce internal reasoning that transfers to inference. This would matter because it targets the exact failure modes on localized and compositional edits without adding cost at test time. The resulting model is shown to beat prior methods on standard benchmarks.

Core claim

RVEDiT addresses the two structural limits of undifferentiated conditioning and indirect attention supervision by introducing Granularity-Routed Token Conditioning, which sends learnable editing tokens distilled from a multimodal LLM into shallow transformer blocks while holding native visual and text tokens for deeper blocks, together with Reference-Anchored Attention Alignment, which shares parameters with a reference branch at training time and maximizes mutual information between the two branches' attention features so that the editing branch learns a constrained internal reasoning process usable without the reference at inference.

What carries the argument

Granularity-Routed Token Conditioning paired with Reference-Anchored Attention Alignment, which routes coarse editing signals early and regularizes attention via mutual-information maximization against a training-only reference branch.

If this is right

  • Localized spatial edits become more accurate because coarse editing intent is processed separately from fine visual detail.
  • Compositional instructions are followed more reliably because the coarse-to-fine routing constrains how multiple constraints interact inside the network.
  • Temporal coherence improves as the aligned attention patterns enforce consistency without extra post-processing.
  • Inference cost remains identical to a standard DiT because the reference branch and mutual-information term are used only at training time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of token granularity and attention alignment could be tested on image-only or 3D diffusion transformers to check whether the coarse-to-fine benefit is video-specific.
  • If the alignment effect holds, future work could replace explicit chain-of-thought modules in other generative models with this cheaper training-time regularizer.
  • The approach suggests that many current failures in instruction-conditioned generation stem from insufficient internal constraints rather than insufficient model capacity.

Load-bearing premise

Maximizing mutual information between the attention features of the editing and reference branches during training will cause the editing branch to develop internal reasoning that improves edits when the reference branch is removed at inference.

What would settle it

An ablation that trains identical models with and without the mutual-information alignment term and measures no difference in localized or compositional edit metrics on the same benchmarks.

Figures

Figures reproduced from arXiv: 2605.24674 by Lin Liu, Qi Tian, Xiaopeng Zhang, Yan Li.

Figure 1
Figure 1. Figure 1: Overview of the RVEDiT framework. (a) Training: Given a source video and instruction, we extract learnable editing tokens via an MLLM. The Granularity-Routed Token Conditioning module (§4.2) routes these abstract tokens to shallow DiT layers to establish global intent, while routing native visual/textual tokens to deeper layers for local refinement. Concurrently, Reference￾Anchored Attention Alignment (§4.… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison on OpenVE-Bench across the four edit categories. RVEDiT localizes the edit precisely, preserves non-edited content, and remains temporally stable, whereas baselines display mislocalization, leakage, residual traces, or over-stylization. 5.2 Ablation and Parameter Study We now isolate the two mechanisms introduced in §4 (GRTC refers to §4.2, RAAA refers to §4.3), and examine whether t… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative examples of the component ablations [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mechanistic analysis of RVEDiT. (a) Attention mass on learnable tokens and native [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Additional examples of RVEDiT on the Local Add task. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional examples of RVEDiT on the Local Remove task. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional examples of RVEDiT on the Local Change task. InfoNCE temperature τ=0.07 and weight λalign=0.1. The diffusion transformer and the MLLM￾to-DiT connectors are fully fine-tuned, while the MLLM language tower is adapted with frozen LoRA [22] adapters (rank 64) initialized from a first-stage checkpoint, and its vision encoder is kept frozen. We optimize the joint objective in Eq. (6) with AdamW at a l… view at source ↗
Figure 8
Figure 8. Figure 8: Additional examples of RVEDiT on the Global Style task. The table groups the ten evaluated configurations into (i) the two sweeps over the Granularity-Routed Token Conditioning (GRTC) routing depth Ls of Eq. (3) and the Reference-Anchored Attention Alignment (RAAA) weight λalign of Eq. (6), and (ii) the three component-removal variants (w/o GRTC, w/o RAAA, w/o both). This decomposition makes it explicit wh… view at source ↗
Figure 9
Figure 9. Figure 9: The visualization of different methods on "Local Add" task. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The visualization of different methods on "Local Remove" task. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The visualization of different methods on "Local Replace" task. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The visualization of different methods on "Style Transfer" task. [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
read the original abstract

Instruction-based video editing requires transforming a source video according to a natural-language instruction while preserving irrelevant content and remaining temporally coherent. We argue that existing Diffusion Transformer (DiT) editors struggle with this task for two structural reasons. First, conditioning signals are fed undifferentiated into all transformer blocks, forcing a single token stream to encode both global editing intent and fine-grained visual evidence. Second, the cross-attention patterns that govern the edit are supervised only indirectly through pixel-level reconstruction, leaving the model's internal reasoning process under-constrained. To address both limitations, we propose RVEDiT, an implicit Reasoning Video Editing DiT framework built around two complementary components. The first, Granularity-Routed Token Conditioning, introduces learnable editing tokens distilled from a multimodal LLM and routes them to shallow blocks, while reserving native visual and textual tokens for deeper blocks, thereby inducing a coarse-to-fine editing process inside the backbone. The second, Reference-Anchored Attention Alignment, employs a parameter-sharing reference branch during training and maximizes the mutual information between the attention features of the editing and reference branches, regularizing the model's internal reasoning without incurring any additional inference cost. Experiments on standard instruction-based video editing benchmarks show that RVEDiT consistently outperforms state-of-the-art baselines, with particularly strong gains on localized and compositional edits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes RVEDiT, a Diffusion Transformer framework for instruction-based video editing. It introduces two components: Granularity-Routed Token Conditioning, which distills editing tokens from a multimodal LLM and routes them to shallow blocks while reserving visual/textual tokens for deeper blocks, and Reference-Anchored Attention Alignment, which uses a parameter-sharing reference branch during training and maximizes mutual information between attention features of the editing and reference branches. The paper claims these induce implicit coarse-to-fine reasoning that transfers to inference without extra cost, yielding consistent outperformance over state-of-the-art baselines on standard benchmarks, especially for localized and compositional edits.

Significance. If the performance gains are substantiated and attributable to the proposed mechanisms rather than other factors, the work could meaningfully advance instruction-based video editing by offering a training-only regularization strategy that addresses under-constrained internal reasoning in DiTs without inference overhead. The focus on routing and alignment for compositional edits targets a recognized limitation in current conditioning approaches.

major comments (3)
  1. Abstract: the claim that 'RVEDiT consistently outperforms state-of-the-art baselines' supplies no quantitative metrics, baseline names, dataset details, ablation results, or statistical significance tests, making it impossible to evaluate whether the data support the performance claim or to attribute gains to the proposed components.
  2. Description of Reference-Anchored Attention Alignment: the central assumption that maximizing mutual information between editing and reference attention features during training produces better internal reasoning that transfers to inference (after dropping the reference branch) is load-bearing for the claimed gains, yet no direct support is provided such as attention map comparisons, probing tasks, or controlled ablations removing the MI term.
  3. Description of Granularity-Routed Token Conditioning: the assumption that routing LLM-derived editing tokens to shallow blocks induces an effective coarse-to-fine editing process is also unverified, with no ablations or visualizations demonstrating the routing's contribution to handling localized/compositional edits versus simply using the multimodal tokens.
minor comments (1)
  1. Abstract: 'standard instruction-based video editing benchmarks' are referenced without naming the specific datasets or protocols used, which should be clarified for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract requires more concrete quantitative support and that additional direct evidence for the two proposed mechanisms would strengthen the claims. We will make revisions accordingly and respond to each major comment below.

read point-by-point responses
  1. Referee: Abstract: the claim that 'RVEDiT consistently outperforms state-of-the-art baselines' supplies no quantitative metrics, baseline names, dataset details, ablation results, or statistical significance tests, making it impossible to evaluate whether the data support the performance claim or to attribute gains to the proposed components.

    Authors: We agree that the abstract would be more informative with explicit metrics. The experiments section already contains these details (quantitative comparisons on standard benchmarks against named baselines, ablation tables, and dataset descriptions). In the revision we will move key numbers, baseline names, and references to the ablation studies into the abstract itself. revision: yes

  2. Referee: Description of Reference-Anchored Attention Alignment: the central assumption that maximizing mutual information between editing and reference attention features during training produces better internal reasoning that transfers to inference (after dropping the reference branch) is load-bearing for the claimed gains, yet no direct support is provided such as attention map comparisons, probing tasks, or controlled ablations removing the MI term.

    Authors: The existing ablation that removes the entire alignment module shows measurable drops on localized edits, providing indirect support. To address the request for direct evidence we will add (i) side-by-side attention-map visualizations of the editing versus reference branches and (ii) a controlled ablation that isolates the mutual-information term while keeping the reference branch. revision: yes

  3. Referee: Description of Granularity-Routed Token Conditioning: the assumption that routing LLM-derived editing tokens to shallow blocks induces an effective coarse-to-fine editing process is also unverified, with no ablations or visualizations demonstrating the routing's contribution to handling localized/compositional edits versus simply using the multimodal tokens.

    Authors: The manuscript already contains an ablation that compares routed versus non-routed placement of the editing tokens and reports gains on compositional edits. To make the coarse-to-fine claim more explicit we will add block-wise activation visualizations that contrast the effect of routing the LLM tokens to shallow layers versus feeding them uniformly. revision: yes

Circularity Check

0 steps flagged

No circularity: method introduces independent training components with empirical validation

full rationale

The paper's derivation chain consists of proposing two new architectural/training mechanisms (Granularity-Routed Token Conditioning via LLM-distilled tokens routed to shallow blocks, and Reference-Anchored Attention Alignment via parameter-shared reference branch with mutual information maximization on attention features) to address stated limitations in DiT conditioning and supervision. No equations, self-definitional reductions, fitted-input-as-prediction steps, or load-bearing self-citations appear in the provided text that would make any claimed result equivalent to its inputs by construction. The central claims rest on the introduction of these components plus benchmark experiments, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5764 in / 1142 out tokens · 34440 ms · 2026-06-30T13:33:56.062307+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 25 canonical work pages · 12 internal anchors

  1. [1]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Agarwal, S

    A. Agarwal, S. Karanam, K. Joseph, A. Saxena, K. Goswami, and B. V . Srinivasan. A-star: Test- time attention segregation and retention for text-to-image synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2283–2293, 2023

  3. [3]

    Q. Bai, Q. Wang, H. Ouyang, H. Wang, W. Wang, K. L. Cheng, S. Ma, Y . Zeng, Y . Yu, Z. Liu, et al. Ditto: Scaling instruction-based video editing with a high-quality synthetic dataset

  4. [4]

    Q. Bai, Q. Wang, H. Ouyang, Y . Yu, H. Wang, W. Wang, K. L. Cheng, S. Ma, Y . Zeng, Z. Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

  5. [5]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  6. [6]

    Brooks, A

    T. Brooks, A. Holynski, and A. A. Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

  7. [7]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  8. [8]

    M. Cao, X. Wang, Z. Qi, Y . Shan, X. Qie, and Y . Zheng. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF international conference on computer vision, pages 22560–22570, 2023

  9. [9]

    Chefer, Y

    H. Chefer, Y . Alaluf, Y . Vinker, L. Wolf, and D. Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023

  10. [10]

    Chefer, S

    H. Chefer, S. Zada, R. Paiss, A. Ephrat, O. Tov, M. Rubinstein, L. Wolf, T. Dekel, T. Michaeli, and I. Mosseri. Still-moving: Customized video generation without customized video data.ACM Transactions on Graphics (TOG), 43(6):1–11, 2024

  11. [11]

    S. Chen, M. Xu, J. Ren, Y . Cong, S. He, Y . Xie, A. Sinha, P. Luo, T. Xiang, and J.-M. Perez-Rua. Gentron: Diffusion transformers for image and video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6441–6451, 2024

  12. [12]

    S. X. Chen, M. Sra, and P. Sen. Instruct-clip: Improving instruction-guided image editing with automated data refinement using contrastive learning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28513–28522, 2025

  13. [13]

    arXiv preprint arXiv:2311.00213 (2023)

    J. Cheng, T. Xiao, and T. He. Consistent video-to-video transfer using synthetic dataset.arXiv preprint arXiv:2311.00213, 2023

  14. [14]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  15. [15]

    X. Cong, H. Yang, A. Wang, Y . Wang, Y . Yang, C. Zhang, and C. Ma. Viva: Vlm-guided instruction-based video editing with reward optimization.arXiv preprint arXiv:2512.16906, 2025

  16. [16]

    Esser, S

    P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  17. [17]

    TokenFlow: Consistent Diffusion Features for Consistent Video Editing

    M. Geyer, O. Bar-Tal, S. Bagon, and T. Dekel. Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373, 2023. 10

  18. [18]

    H. He, J. Wang, J. Zhang, Z. Xue, X. Bu, Q. Yang, S. Wen, and L. Xie. Openve-3m: A large-scale high-quality dataset for instruction-guided video editing.arXiv preprint arXiv:2512.07826, 2025

  19. [19]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y . Pritch, and D. Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

  20. [20]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  21. [21]

    N. Ho, S. Bae, T. Kim, H. Jo, Y . Kim, T. Schuster, A. Fisch, J. Thorne, and S.-Y . Yun. Block transformer: Global-to-local language modeling for fast inference.Advances in Neural Information Processing Systems, 37:48740–48783, 2024

  22. [22]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  23. [23]

    Jiang, T

    Y . Jiang, T. Wu, S. Yang, C. Si, D. Lin, Y . Qiao, C. C. Loy, and Z. Liu. Videobooth: Diffusion-based video generation with image prompts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6689–6700, 2024

  24. [24]

    Jiang, Z

    Z. Jiang, Z. Han, C. Mao, J. Zhang, Y . Pan, and Y . Liu. Vace: All-in-one video creation and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17191–17202, 2025

  25. [25]

    M. Ku, C. Wei, W. Ren, H. Yang, and W. Chen. Anyv2v: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024

  26. [26]

    X. Liao, X. Zeng, Z. Song, Z. Fu, G. Yu, and G. Lin. In-context learning with unpaired clips for instruction-based video editing.arXiv preprint arXiv:2510.14648, 2025

  27. [27]

    Y . Lin, G. Liang, Z. Zeng, Z. Bai, Y . Chen, and M. Z. Shou. Kiwi-edit: Versatile video editing via instruction and reference guidance.arXiv preprint arXiv:2603.02175, 2026

  28. [28]

    Flow Matching for Generative Modeling

    Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

  29. [29]

    L. Liu, T. Ma, B. Li, Z. Chen, J. Liu, G. Li, S. Zhou, Q. He, and X. Wu. Phantom: Subject-consistent video generation via cross-modal alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14951–14961, 2025

  30. [30]

    S. Liu, Y . Zhang, W. Li, Z. Lin, and J. Jia. Video-p2p: Video editing with cross-attention control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8599–8608, 2024

  31. [31]

    Y . Mao, Z. Qin, X. Shen, J. Zhou, J. Zhang, Y . Zhong, and Y . Dai. Ldt: Efficient scalable video generation using linear diffusion transformer.IEEE Transactions on Multimedia, 2026

  32. [32]

    A. v. d. Oord, Y . Li, and O. Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  33. [33]

    Ouyang, Y

    W. Ouyang, Y . Dong, L. Yang, J. Si, and X. Pan. I2vedit: First-frame-guided video editing via image-to- video diffusion models. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

  34. [34]

    Patashnik, D

    O. Patashnik, D. Garibi, I. Azuri, H. Averbuch-Elor, and D. Cohen-Or. Localizing object-level shape variations with text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 23051–23061, 2023

  35. [35]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  36. [36]

    Movie Gen: A Cast of Media Foundation Models

    A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y . Ma, C.-Y . Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

  37. [37]

    C. Qi, X. Cun, Y . Zhang, C. Lei, X. Wang, Y . Shan, and Q. Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15932–15942, 2023

  38. [38]

    B. Qin, J. Li, S. Tang, T.-S. Chua, and Y . Zhuang. Instructvid2vid: Controllable video editing with natural language instructions. In2024 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2024. 11

  39. [39]

    Rassin, E

    R. Rassin, E. Hirsch, D. Glickman, S. Ravfogel, Y . Goldberg, and G. Chechik. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment.Advances in Neural Information Processing Systems, 36:3536–3559, 2023

  40. [40]

    W. Ren, H. Yang, G. Zhang, C. Wei, X. Du, W. Huang, and W. Chen. Consisti2v: Enhancing visual consistency for image-to-video generation.arXiv preprint arXiv:2402.04324, 2024

  41. [41]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  42. [42]

    Shagidanov, H

    A. Shagidanov, H. Poghosyan, X. Gong, Z. Wang, S. Navasardyan, and H. Shi. Grounded-instruct-pix2pix: Improving instruction based image editing with automatic target grounding. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6585–6589. IEEE, 2024

  43. [43]

    Swetha, J

    S. Swetha, J. Yang, T. Neiman, M. N. Rizve, S. Tran, B. Yao, T. Chilimbi, and M. Shah. X-former: Unifying contrastive and reconstruction learning for mllms.arXiv preprint arXiv:2407.13851, 2024

  44. [44]

    Z. Tan, H. Yang, L. Qin, J. Gong, M. Yang, and H. Li. Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

  45. [45]

    R. Tang, L. Liu, A. Pandey, Z. Jiang, G. Yang, K. Kumar, P. Stenetorp, J. Lin, and F. Türe. What the daam: Interpreting stable diffusion using cross attention. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5644–5659, 2023

  46. [46]

    D. Team. Lucy edit: Open-weight text-guided video editing, 2025

  47. [47]

    S. Tu, Q. Dai, Z. Zhang, S. Xie, Z.-Q. Cheng, C. Luo, X. Han, Z. Wu, and Y .-G. Jiang. Motionfollower: Editing video motion via score-guided diffusion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12822–12831, 2025

  48. [48]

    Tumanyan, M

    N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel. Plug-and-play diffusion features for text-driven image-to- image translation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1921–1930, 2023

  49. [49]

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  50. [50]

    J. Z. Wu, Y . Ge, X. Wang, S. W. Lei, Y . Gu, Y . Shi, W. Hsu, Y . Shan, X. Qie, and M. Z. Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF international conference on computer vision, pages 7623–7633, 2023

  51. [51]

    G. Xiao, T. Yin, W. T. Freeman, F. Durand, and S. Han. Fastcomposer: Tuning-free multi-subject image generation with localized attention.International Journal of Computer Vision, 133(3):1175–1194, 2025

  52. [52]

    H. Yang, Z. Tan, J. Gong, L. Qin, H. Chen, X. Yang, Y . Sun, Y . Lin, M. Yang, and H. Li. Omni- video 2: Scaling mllm-conditioned diffusion for unified video generation and editing.arXiv preprint arXiv:2602.08820, 2026

  53. [53]

    H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

  54. [54]

    S. Yu, D. Liu, Z. Ma, Y . Hong, Y . Zhou, H. Tan, J. Chai, and M. Bansal. Veggie: Instructional editing and reasoning video concepts with grounded generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15147–15158, 2025

  55. [55]

    Zhang, A

    L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

  56. [56]

    Meta-CoT: Enhancing Granularity and Generalization in Image Editing

    S. Zhang, Y . Cheng, T. Hang, Z. Yin, R. He, Y . Xu, W. Dai, Y . Lin, C. Wang, Q. Lu, et al. Meta-cot: Enhancing granularity and generalization in image editing.arXiv preprint arXiv:2604.24625, 2026

  57. [57]

    Zhang, Z

    Z. Zhang, Z. Dai, L. Qin, and W. Wang. Effived: Efficient video editing via text-instruction diffusion models.arXiv preprint arXiv:2403.11568, 2024

  58. [58]

    Region-Constraint In-Context Generation for Instructional Video Editing

    Z. Zhang, F. Long, W. Li, Z. Qiu, W. Liu, T. Yao, and T. Mei. Region-constraint in-context generation for instructional video editing.arXiv preprint arXiv:2512.17650, 2025. 12

  59. [59]

    R. Zhao, Y . Gu, J. Z. Wu, D. J. Zhang, J.-W. Liu, W. Wu, J. Keppo, and M. Z. Shou. Motiondirector: Motion customization of text-to-video diffusion models. InEuropean Conference on Computer Vision, pages 273–290. Springer, 2024

  60. [60]

    Y . Zhu, H. Wang, S. Ma, W. Zhao, Y . Tang, L. Chen, and J. Zhou. Fade: Frequency-aware diffusion model factorization for video editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28426–28435, 2025

  61. [61]

    A golden retrieverwearing a red harness is walking slowly with its nose close to the dry, leaf-covered ground in a fenced yard next to a bush and a road in the background

    B. Zi, P. Ruan, M. Chen, X. Qi, S. Hao, S. Zhao, Y . Huang, B. Liang, R. Xiao, and K.-F. Wong. Se \˜ norita-2m: A high-quality instruction-based dataset for general video editing by video specialists.arXiv preprint arXiv:2502.06734, 2025. 13 A Related Work A.1 Reference-guided Video Generation and Editing A complementary line of work conditions generation...