pith. sign in

arxiv: 2606.01399 · v1 · pith:B2DNAQWLnew · submitted 2026-05-31 · 💻 cs.CV

PAI-Studio: Cinematic Video Background Replacement with Camera-Aware Motion

Pith reviewed 2026-06-28 17:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords video background replacementdiffusion transformercinematic video synthesisforeground relightingreference-conditioned generationcamera-aware motionbidirectional attention
0
0 comments X

The pith

PAI-Studio uses bidirectional attention in a Diffusion Transformer to replace video backgrounds while preserving foreground motion, identity, and lighting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a reference-conditioned video synthesis task for cinematic background replacement that requires generating dynamic backgrounds aligned with foreground motion, matching reference appearance, and achieving consistent illumination with realistic relighting. It reformulates the task as in-context conditional generation on a Diffusion Transformer video backbone, where bidirectional attention jointly processes foreground dynamics and background reference information. The authors construct a 30K-scale dataset from high-quality films and videos to train the model. Extensive evaluations show the resulting system outperforms existing open-source and commercial solutions on motion-consistent background generation, high-fidelity foreground relighting, and foreground identity preservation.

Core claim

A Diffusion Transformer video backbone reformulated as an in-context conditional generation task via bidirectional attention, trained on a 30K film-derived dataset, produces dynamic backgrounds that align with foreground motion while preserving foreground identity and achieving globally consistent illumination with realistic relighting.

What carries the argument

Bidirectional attention within a Diffusion Transformer video backbone that jointly captures foreground dynamics and background reference information in a unified architecture.

If this is right

  • The model produces backgrounds whose motion matches the foreground camera path without separate pose estimation.
  • Foreground objects receive relighting that matches the new background's illumination while retaining original appearance details.
  • Identity of foreground subjects remains stable across frames and across different reference backgrounds.
  • The unified architecture avoids compositing artifacts such as inconsistent boundaries that appear in prior open-source and API systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may extend to other reference-conditioned video tasks such as object insertion or style transfer if the same bidirectional attention pattern is retained.
  • Performance on arbitrary real-world footage could be further improved by adding lightweight camera or lighting adapters on top of the existing backbone.
  • The 30K dataset construction process itself offers a template for curating cinematic training data for related synthesis problems.

Load-bearing premise

A single bidirectional-attention Diffusion Transformer trained on the authors' 30K film-derived dataset will generalize to arbitrary real-world camera motion and lighting without additional explicit camera-pose or lighting estimation modules.

What would settle it

Quantitative failure on a held-out test set of real-world videos containing camera trajectories or lighting conditions absent from the 30K training set, measured by drops in motion consistency, relighting fidelity, or identity preservation metrics below commercial baselines.

Figures

Figures reproduced from arXiv: 2606.01399 by Bangxun Tang, Guian Fang, Heyuan Gao, Jie Yang, Mike Zheng Shou, Yiren Song, Zijian He.

Figure 1
Figure 1. Figure 1: Given a foreground video and background reference images, our method synthesizes a complete video with camera-aware [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the PAI-Studio architecture. Multi-condition inputs—including multiple background reference images, the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the CineStudio data construction pipeline. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Generation results of PAI-Studio. Readers can click and play the video clips in this figure using [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison results show that our method consistently outperforms baseline approaches in structural fidelity, illumination har [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study results. The full model achieves the best or near-best performance, validating the importance of position encoding for motion consistency, illumination adaptation for realistic integration, and structured annotation for detailed guidance. Readers can click and play the videos in this figure using Adobe Acrobat [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: User study results comparing our method with open [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: User study interface and workflow. (a) Study instruc￾tions, (b) side-by-side presentation of a test case with input refer￾ences and anonymized results, and (c) the questionnaire with five evaluation criteria (Overall Preference, Background Consistency, Background Motion Rationality, Illumination Harmony, and Edge Harmony). is most consistent with the reference image? 3. Background Motion Rationality: Which… view at source ↗
Figure 9
Figure 9. Figure 9: Statistical significance of user study results. (a) Heatmap of our method’s win rate (%) in pairwise comparisons against each baseline across five evaluation dimensions. All cells demonstrate statistical significance (p < 0.01, denoted by ** or ***). (b) Significance level (− log10 p) for the Overall Prefer￾ence dimension; all comparisons safely exceed the p = 0.001 threshold. our method achieves a decisiv… view at source ↗
Figure 10
Figure 10. Figure 10: Gemini evaluation prompt templates (Cine-Restore Subset). Three metrics compare the generated composite against the ground truth video. Each prompt instructs Gemini 3 Pro [48] to return a 1–10 score with a brief rationale. out-of-bounds points from skewing the evaluation, we apply a strict joint visibility mask. Specifically, a point’s trajectory is only considered valid for comparison if its visibility c… view at source ↗
Figure 12
Figure 12. Figure 12: Gemini evaluation prompt templates (Cine-NBG Subset, part 2/2). Edge Cons evaluates matte/edge quality of the generated video. FG Preserve compares the foreground region in the generated composite with the original foreground reference ex￾tracted from the green-screen video. fidence remains greater than 0.5 in both the generated and GT videos simultaneously. Scale-Invariant Trajectory Comparison. A key ch… view at source ↗
Figure 13
Figure 13. Figure 13: Visual results of excluded open-source IV2V baselines. When tasked with green-screen background replacement given a reference background (Column 1) and an input foreground video (Column 2), existing models completely fail to synthesize a coherent video. They suffer from severe hallucination, identity loss, semantic misinterpretation of the green screen, or complete conditioning collapse into noise. Reader… view at source ↗
Figure 14
Figure 14. Figure 14: Correlation between human preference and Gemini evaluation. The scatter plots illustrate the relationship between the human vote proportion and the Gemini softmax probability for each method. The Spearman rank correlation (ρ) ranges from 0.503 to 0.682 across all individual dimensions, with an overall combined correlation of 0.595 (all with p < 0.001). This indi￾cates a robust and statistically significan… view at source ↗
Figure 15
Figure 15. Figure 15: Superior edge harmonization. Compared to baselines that suffer from severe green spill and boundary artifacts (highlighted by red boxes), our model generates clean, natural edges without residual green contamination. video (T2V) or purely image-to-video (I2V) models that lack Image-Video-to-Video (IV2V) support. Specifically, these models cannot perform video background replace￾ment conditioned on one or … view at source ↗
Figure 16
Figure 16. Figure 16: Robustness to imperfect foreground segmentation. We highlight four challenging cases where the input green-screen videos (top rows) contain mask defects, as indicated by the red bounding boxes. These include severely corrupted body parts, arbitrary holes, and artificial occlusions. Without any explicit inpainting prompts, our model (bottom rows) robustly reconstructs the missing foreground structures whil… view at source ↗
Figure 17
Figure 17. Figure 17: Implicit scene-adaptive relighting. Our model automatically harmonizes the foreground illumination with the newly synthe￾sized backgrounds without requiring any explicit relighting prompts or maps [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Effect of multi-frame background control. Comparing inference results conditioned on 1, 2, and 3 background reference images. The 3-reference control effectively anchors the background at the beginning, middle, and end temporal locations, yielding the most temporally coherent results that closely match the ground truth (GT). Fewer reference frames lead to information deficits at unconditioned time steps, … view at source ↗
Figure 19
Figure 19. Figure 19: Visual demonstration of the limitations of efficient conditioning strategies. Both spatial downsampling (a) and temporal downsampling (b) introduce noticeable visual degradation. Key artifacts include severe facial degradation, the emergence of black blotches in the background, and the deformation or disappearance of structural symbols. F. Superior Edge Harmonization and Green Spill Suppression A fundamen… view at source ↗
Figure 21
Figure 21. Figure 21: Prompt template for reference-conditioned struc￾tured caption generation without GT. Compared with [PITH_FULL_IMAGE:figures/full_fig_p024_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Failure case examples. We highlight three typical failure modes: (1) artifacts near transparent/refractive boundaries (e.g., glass edges), (2) occasional abnormal foreground relighting (stochastic; alleviated by re-sampling seeds), and (3) unrealistic background details under abrupt scene transitions with weak semantic continuity. Readers can click and play the video clips in this figure using Adobe Acrob… view at source ↗
read the original abstract

We present PAI-Studio, a new reference-conditioned video synthesis task that addresses a long-standing challenge in cinematic background replacement: generating dynamic backgrounds aligned with foreground motion while preserving foreground identity, matching reference scene appearance, and achieving globally consistent illumination with realistic foreground relighting. Existing open-source systems and commercial APIs cannot simultaneously ensure motion-consistent background generation, high-fidelity foreground relighting and foreground identity preservation, often resulting in static backgrounds, inconsistent boundaries, and noticeable compositing artifacts. To bridge this gap, we build upon a Diffusion Transformer video backbone and reformulate the problem as an in-context conditional generation task. Through bidirectional attention, our model jointly captures foreground dynamics and background reference information within a unified architecture. We further construct a 30K-scale dataset sourced from high-quality films and online videos to support this task. Extensive evaluations demonstrate that our method significantly outperforms existing open-source and commercial API solutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents PAI-Studio, a reference-conditioned video synthesis method for cinematic background replacement. It builds on a Diffusion Transformer backbone, reformulating the task as in-context conditional generation with bidirectional attention to jointly model foreground dynamics and background references. A 30K-scale dataset sourced from films supports training. The central claim is that the approach significantly outperforms existing open-source and commercial solutions in motion-consistent background generation, high-fidelity foreground relighting, and foreground identity preservation.

Significance. If the empirical claims are substantiated, the work would advance video synthesis for film and media applications by showing that a single bidirectional-attention DiT can implicitly handle camera-aware motion and global illumination without dedicated pose or lighting modules.

major comments (2)
  1. [Abstract] Abstract: the claim that the method 'significantly outperforms existing open-source and commercial API solutions' is asserted without any quantitative metrics, baselines, ablation studies, or error analysis supplied. This is load-bearing for the central contribution.
  2. [Method] The central claim rests on the assumption that bidirectional attention in a single DiT, conditioned only on foreground and reference, suffices to capture camera-aware dynamics and global illumination from the 30K film-derived dataset without explicit camera-pose or lighting estimation modules. No tests for out-of-distribution real-world motions or lighting are described, which is load-bearing for the generalization and outperformance assertions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. Below we address each major point directly, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the method 'significantly outperforms existing open-source and commercial API solutions' is asserted without any quantitative metrics, baselines, ablation studies, or error analysis supplied. This is load-bearing for the central contribution.

    Authors: The abstract is a concise summary; the full manuscript supplies the requested quantitative metrics (including PSNR/SSIM/LPIPS and user-study scores), multiple baselines, ablations, and error analysis in the Experiments section. We will revise the abstract to include one or two representative quantitative results so the outperformance claim is immediately supported by numbers. revision: partial

  2. Referee: [Method] The central claim rests on the assumption that bidirectional attention in a single DiT, conditioned only on foreground and reference, suffices to capture camera-aware dynamics and global illumination from the 30K film-derived dataset without explicit camera-pose or lighting estimation modules. No tests for out-of-distribution real-world motions or lighting are described, which is load-bearing for the generalization and outperformance assertions.

    Authors: Our experiments show that bidirectional attention within the single DiT, trained on the diverse 30K film corpus, implicitly learns camera-aware motion and global illumination without auxiliary pose or lighting modules, as evidenced by the reported gains in motion consistency and relighting fidelity. The dataset already spans a wide range of professional camera movements and lighting conditions. We acknowledge that explicit OOD testing on non-cinematic real-world footage is not presented; we will add a limitations paragraph clarifying the intended cinematic scope and the absence of broader OOD claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical claims rest on external benchmarks and new dataset

full rationale

The paper introduces a reference-conditioned video synthesis task solved via a Diffusion Transformer backbone reformulated as in-context conditional generation with bidirectional attention, trained on a newly constructed 30K film-derived dataset. Central claims of outperformance in motion-consistent backgrounds, relighting, and identity preservation are supported by evaluations against external open-source systems and commercial APIs. No equations, fitted parameters, self-citations, uniqueness theorems, or ansatzes are described that reduce any prediction or result to the paper's own inputs by construction. The derivation chain is self-contained as an empirical engineering contribution evaluated on independent test cases.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or invented physical entities are described; the work is an empirical system whose central claim rests on the unstated assumption that the collected 30K dataset distribution matches real cinematic use cases.

pith-pipeline@v0.9.1-grok · 5697 in / 1104 out tokens · 19889 ms · 2026-06-28T17:12:20.833467+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GeoEdit: Geometry-Aware Object Editing via Dual-Branch Denoising

    cs.CV 2026-06 unverdicted novelty 6.0

    GeoEdit introduces a Lift-Manipulate-Render-Denoise pipeline with dual-branch denoising and variance-homogeneous injection for 3D-consistent object editing in single photos.

  2. LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing

    cs.CV 2026-06 unverdicted novelty 6.0

    LiveEdit distills a bidirectional video foundation model into a unidirectional streaming editor via three-stage training plus mask caching to reach 12.66 FPS with stable edits.

Reference graph

Works this paper leans on

73 extracted references · 14 linked inside Pith · cited by 2 Pith papers

  1. [1]

    Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 5

  2. [2]

    Text2live: Text-driven layered image and video editing

    Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kas- ten, and Tali Dekel. Text2live: Text-driven layered image and video editing. InEuropean conference on computer vi- sion, pages 707–723. Springer, 2022. 3

  3. [3]

    Videopainter: Any- length video inpainting and editing with plug-and-play con- text control

    Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, and Qiang Xu. Videopainter: Any- length video inpainting and editing with plug-and-play con- text control. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–12, 2025. 8

  4. [4]

    Align your latents: High-resolution video synthesis with la- tent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 22563–22575, 2023. 2, 3

  5. [5]

    Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024. 2

  6. [6]

    Ditctrl: Exploring attention control in multi-modal dif- fusion transformer for tuning-free multi-prompt longer video generation

    Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong Zhang, Ying Shan, and Xiangyu Yue. Ditctrl: Exploring attention control in multi-modal dif- fusion transformer for tuning-free multi-prompt longer video generation. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 7763–7772, 2025. 3

  7. [7]

    Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025. 5

  8. [8]

    Videocrafter1: Open diffusion models for high-quality video generation.arXiv preprint arXiv:2310.19512, 2023

    Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation.arXiv preprint arXiv:2310.19512, 2023. 2, 3

  9. [9]

    Sparse-vdit: Unleashing the power of sparse attention to accelerate video diffusion transformers.arXiv preprint arXiv:2506.03065, 2025

    Pengtao Chen, Xianfang Zeng, Maosen Zhao, Peng Ye, Mingzhu Shen, Wei Cheng, Gang Yu, and Tao Chen. Sparse-vdit: Unleashing the power of sparse attention to accelerate video diffusion transformers.arXiv preprint arXiv:2506.03065, 2025. 2

  10. [10]

    Transanimate: Taming layer diffusion to generate rgba video.arXiv preprint arXiv:2503.17934, 2025

    Xuewei Chen, Zhimin Chen, and Yiren Song. Transanimate: Taming layer diffusion to generate rgba video.arXiv preprint arXiv:2503.17934, 2025. 3

  11. [11]

    Ivebench: Modern benchmark suite for instruction-guided video editing assessment.arXiv preprint arXiv:2510.11647, 2025

    Yinan Chen, Jiangning Zhang, Teng Hu, Yuxiang Zeng, Zhu- cun Xue, Qingdong He, Chengjie Wang, Yong Liu, Xiaobin Hu, and Shuicheng Yan. Ivebench: Modern benchmark suite for instruction-guided video editing assessment.arXiv preprint arXiv:2510.11647, 2025. 7

  12. [12]

    Structure and content-guided video synthesis with diffusion models

    Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 7346–7356, 2023. 2, 3

  13. [13]

    Lora-edit: Controllable first-frame- guided video editing via mask-aware lora fine-tuning.arXiv preprint arXiv:2506.10082, 2025

    Chenjian Gao, Lihe Ding, Xin Cai, Zhanpeng Huang, Zibin Wang, and Tianfan Xue. Lora-edit: Controllable first-frame- guided video editing via mask-aware lora fine-tuning.arXiv preprint arXiv:2506.10082, 2025. 8

  14. [14]

    Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffu- sion transformers.arXiv preprint arXiv:2405.05945, 2024

    Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, et al. Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffu- sion transformers.arXiv preprint arXiv:2405.05945, 2024. 2, 3

  15. [15]

    Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373, 2023

    Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373, 2023. 3

  16. [16]

    Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022. 3

  17. [17]

    Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 3

  18. [18]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022. 6

  19. [19]

    Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet.arXiv preprint arXiv:2307.14073,

    Zhihao Hu and Dong Xu. Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet.arXiv preprint arXiv:2307.14073,

  20. [20]

    Vace: All-in-one video creation and editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 17191–17202, 2025. 6

  21. [21]

    Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos

    Nikita Karaev, Yuri Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 6013–6022,

  22. [22]

    Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2

  23. [23]

    Anyv2v: A tuning-free framework for any video-to- video editing tasks.arXiv preprint arXiv:2403.14468, 2024

    Max Ku, Cong Wei, Weiming Ren, Harry Yang, and Wenhu Chen. Anyv2v: A tuning-free framework for any video-to- video editing tasks.arXiv preprint arXiv:2403.14468, 2024. 8

  24. [24]

    Towards an end-to-end framework for flow-guided video inpainting

    Zhen Li, Cheng-Ze Lu, Jianhua Qin, Chun-Le Guo, and Ming-Ming Cheng. Towards an end-to-end framework for flow-guided video inpainting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17562–17571, 2022. 3

  25. [25]

    Real-time high-resolution background matting

    Shanchuan Lin, Andrey Ryabtsev, Soumyadip Sengupta, Brian L Curless, Steven M Seitz, and Ira Kemelmacher- Shlizerman. Real-time high-resolution background matting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8762–8771, 2021. 3

  26. [26]

    Robust high-resolution video matting with tempo- ral guidance

    Shanchuan Lin, Linjie Yang, Imran Saleemi, and Soumyadip Sengupta. Robust high-resolution video matting with tempo- ral guidance. InProceedings of the IEEE/CVF Winter Con- ference on Applications of Computer Vision, pages 238–247,

  27. [27]

    Omnipsd: Layered psd generation with diffusion transformer.arXiv preprint arXiv:2512.09247, 2025

    Cheng Liu, Yiren Song, Haofan Wang, and Mike Zheng Shou. Omnipsd: Layered psd generation with diffusion transformer.arXiv preprint arXiv:2512.09247, 2025. 2

  28. [28]

    Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

  29. [29]

    Follow your pose: Pose- guided text-to-video generation using pose-free videos

    Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose- guided text-to-video generation using pose-free videos. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 4117–4125, 2024. 3

  30. [30]

    Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation

    Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, et al. Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–12, 2024

  31. [31]

    Follow-your-click: Open-domain regional image animation via motion prompts

    Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Leqi Shen, Chenyang Qi, Jixuan Ying, Chengfei Cai, Zhifeng Li, Heung-Yeung Shum, et al. Follow-your-click: Open-domain regional image animation via motion prompts. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 6018–6026, 2025

  32. [32]

    Follow-your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning.arXiv preprint arXiv:2506.05207, 2025

    Yue Ma, Yulong Liu, Qiyuan Zhu, Ayden Yang, Kunyu Feng, Xinhua Zhang, Zhifeng Li, Sirui Han, Chenyang Qi, and Qifeng Chen. Follow-your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning.arXiv preprint arXiv:2506.05207, 2025. 3

  33. [33]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  34. [34]

    The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017

    Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- bel´aez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017. 7

  35. [35]

    Fatezero: Fus- ing attentions for zero-shot text-based video editing

    Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fus- ing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15932–15942, 2023. 3

  36. [36]

    High-resolution image synthesis with latent dif- fusion models [internet].arXiv [cs

    R Rombach. High-resolution image synthesis with latent dif- fusion models [internet].arXiv [cs. CV]., 2021. 8

  37. [37]

    Make-a-video: Text-to-video genera- tion without text-video data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video genera- tion without text-video data. InInternational Conference on Learning Representations, 2023. 3

  38. [38]

    Video edit- ing via factorized diffusion distillation

    Uriel Singer, Amit Zohar, Yuval Kirstain, Shelly Sheynin, Adam Polyak, Devi Parikh, and Yaniv Taigman. Video edit- ing via factorized diffusion distillation. InEuropean Con- ference on Computer Vision, pages 450–466. Springer, 2024. 3

  39. [39]

    Worldwander: Bridging egocentric and exocentric worlds in video generation.arXiv preprint arXiv:2511.22098, 2025

    Quanjian Song, Yiren Song, Kelly Peng, Yuan Gao, and Mike Zheng Shou. Worldwander: Bridging egocentric and exocentric worlds in video generation.arXiv preprint arXiv:2511.22098, 2025. 3

  40. [40]

    Proces- spainter: Learn painting process from sequence data.arXiv preprint arXiv:2406.06062, 2024

    Yiren Song, Shijie Huang, Chen Yao, Xiaojun Ye, Hai Ci, Jiaming Liu, Yuxuan Zhang, and Mike Zheng Shou. Proces- spainter: Learn painting process from sequence data.arXiv preprint arXiv:2406.06062, 2024. 3

  41. [41]

    Layer- tracer: Cognitive-aligned layered svg synthesis via diffusion transformer.arXiv preprint arXiv:2502.01105, 2025

    Yiren Song, Danze Chen, and Mike Zheng Shou. Layer- tracer: Cognitive-aligned layered svg synthesis via diffusion transformer.arXiv preprint arXiv:2502.01105, 2025. 3

  42. [42]

    Mitty: Diffusion-based human-to-robot video generation

    Yiren Song, Cheng Liu, Weijia Mao, and Mike Zheng Shou. Mitty: Diffusion-based human-to-robot video generation. arXiv preprint arXiv:2512.17253, 2025. 2

  43. [43]

    Makeany- thing: Harnessing diffusion transformers for multi- domain procedural sequence generation.arXiv preprint arXiv:2502.01572, 2025

    Yiren Song, Cheng Liu, and Mike Zheng Shou. Makeany- thing: Harnessing diffusion transformers for multi- domain procedural sequence generation.arXiv preprint arXiv:2502.01572, 2025. 3

  44. [44]

    Omniconsis- tency: Learning style-agnostic consistency from paired styl- ization data.arXiv preprint arXiv:2505.18445, 2025

    Yiren Song, Cheng Liu, and Mike Zheng Shou. Omniconsis- tency: Learning style-agnostic consistency from paired styl- ization data.arXiv preprint arXiv:2505.18445, 2025. 2

  45. [45]

    Omnihumanoid: Streaming cross- embodiment video generation with paired-free adaptation

    Yiren Song, Xiyao Deng, Pei Yang, Yihan Wang, and Mike Zheng Shou. Omnihumanoid: Streaming cross- embodiment video generation with paired-free adaptation. arXiv preprint arXiv:2605.12038, 2026

  46. [46]

    Vista: Triplet-supervised video style transfer with dif- fusion transformers.arXiv preprint arXiv:2605.17312, 2026

    Yiren Song, Wangzi Yao, Haofan Wang, and Mike Zheng Shou. Vista: Triplet-supervised video style transfer with dif- fusion transformers.arXiv preprint arXiv:2605.17312, 2026. 2

  47. [47]

    Deep video matting via spatio-temporal align- ment and aggregation

    Yanan Sun, Guanzhi Wang, Qiao Gu, Chi-Keung Tang, and Yu-Wing Tai. Deep video matting via spatio-temporal align- ment and aggregation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6975–6984, 2021. 3

  48. [48]

    Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 4, 6, 3, 12

  49. [49]

    Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 2, 3, 15

  50. [50]

    Video-to- video synthesis

    Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to- video synthesis. InAdvances in Neural Information Pro- cessing Systems, 2018. 3

  51. [51]

    Videocomposer: Compositional video synthesis with motion controllability.Advances in Neural Information Processing Systems, 36:7594–7611, 2023

    Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Ji- uniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jin- gren Zhou. Videocomposer: Compositional video synthesis with motion controllability.Advances in Neural Information Processing Systems, 36:7594–7611, 2023. 8

  52. [52]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 7

  53. [53]

    Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

    Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7623–7633, 2023. 3

  54. [54]

    Dynamicrafter: Animating open-domain images with video diffusion priors

    Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain images with video diffusion priors. InEu- ropean Conference on Computer Vision, pages 399–417. Springer, 2024. 3

  55. [55]

    Motioncanvas: Cinematic shot design with controllable image-to-video generation

    Jinbo Xing, Long Mai, Cusuh Ham, Jiahui Huang, Anirud- dha Mahapatra, Chi-Wing Fu, Tien-Tsin Wong, and Feng Liu. Motioncanvas: Cinematic shot design with controllable image-to-video generation. InProceedings of the Special Interest Group on Computer Graphics and Interactive Tech- niques Conference Conference Papers, pages 1–11, 2025. 3

  56. [56]

    Magicanimate: Temporally consistent human im- age animation using diffusion model

    Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human im- age animation using diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1481–1490, 2024. 8

  57. [57]

    Bachvid: Training-free video generation with consistent background and character.arXiv preprint arXiv:2510.21696, 2025

    Han Yan, Xibin Song, Yifu Wang, Hongdong Li, Pan Ji, and Chao Ma. Bachvid: Training-free video generation with consistent background and character.arXiv preprint arXiv:2510.21696, 2025. 3

  58. [58]

    Motion-conditioned image animation for video editing.arXiv preprint arXiv:2311.18827, 2023

    Wilson Yan, Andrew Brown, Pieter Abbeel, Rohit Girdhar, and Samaneh Azadi. Motion-conditioned image animation for video editing.arXiv preprint arXiv:2311.18827, 2023. 3

  59. [59]

    Rerender a video: Zero-shot text-guided video-to-video translation

    Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to-video translation. InSIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023. 3

  60. [60]

    Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2

  61. [61]

    Transparent image layer diffusion using latent transparency.arXiv preprint arXiv:2402.17113, 2024

    Lvmin Zhang and Maneesh Agrawala. Transparent image layer diffusion using latent transparency.arXiv preprint arXiv:2402.17113, 2024. 3

  62. [62]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 7

  63. [63]

    Easycontrol: Adding efficient and flexible control for diffusion transformer.arXiv preprint arXiv:2503.07027, 2025

    Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer.arXiv preprint arXiv:2503.07027, 2025. 2

  64. [64]

    Propainter: Improving propagation and transformer for video inpainting

    Shangchen Zhou, Chongyi Li, Kelvin CK Chan, and Chen Change Loy. Propainter: Improving propagation and transformer for video inpainting. InProceedings of the IEEE/CVF international conference on computer vision, pages 10477–10486, 2023. 3 PAI-Studio: Cinematic Video Background Replacement with Camera-Aware Motion Supplementary Material The supplementary m...

  65. [65]

    We provide comprehensive details regarding the user study, including the interface, evaluation criteria, and statistical significance analysis in Sec. A

  66. [66]

    B, and the methodology for evaluating motion fidelity using Co- Tracker3 in Sec

    We detail the prompts and metrics used for our Gemini- based automated semantic evaluation in Sec. B, and the methodology for evaluating motion fidelity using Co- Tracker3 in Sec. C

  67. [67]

    We present the statistical correlation between our auto- mated evaluation and human preference in Sec. D

  68. [68]

    We analyze the failure modes of the excluded open- source baselines in Sec. E

  69. [69]

    F), robustness to imperfect masks (Sec

    We provide additional qualitative results, demonstrat- ing superior edge harmonization (Sec. F), robustness to imperfect masks (Sec. G), implicit scene-adaptive re- lighting (Sec. H), and the effect of multi-frame control (Sec. J)

  70. [70]

    We discuss our explorations into efficient conditioning strategies in Sec. K

  71. [71]

    We provide the prompt templates used for generating structured video annotations in Sec. I

  72. [72]

    We report the computational cost and resources for train- ing and inference in Sec. L

  73. [73]

    green spill

    We present an analysis of typical failure cases in Sec. M. A. User Study Details The interface and workflow of our user study are illustrated in Fig. 8. We designed a comprehensive evaluation pipeline to compare the perceptual quality of our method against five representative baselines. Experimental Protocol.As shown in the study in- structions (Fig. 8a),...