VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers

· 2026 · cs.CV · arXiv 2605.17312

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Video style transfer aims to render videos in a target artistic style while preserving content, structure, and motion. While image stylization has advanced rapidly, video stylization remains challenging due to temporal inconsistency. Most existing methods stylize frames or keyframes and enforce consistency via heuristic temporal propagation, which is brittle under occlusions, disocclusions, and long-term motion, leading to drift and flickering artifacts. We argue that a fundamental bottleneck lies in the lack of large-scale triplet data and a principled training paradigm that jointly models and disentangles style, content, and motion.To address this, we introduce VISTA-1000, a synthetic dataset with 1,000 styles and motion-aligned triplets of style reference, clean video, and stylized video, and propose a diffusion-transformer-based in-context video style transfer framework with a lightweight style adapter for robust style extraction. Extensive experiments demonstrate SOTA performance in style fidelity, temporal consistency, and content preservation.

representative citing papers

GeoEdit: Geometry-Aware Object Editing via Dual-Branch Denoising

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

GeoEdit introduces a Lift-Manipulate-Render-Denoise pipeline with dual-branch denoising and variance-homogeneous injection for 3D-consistent object editing in single photos.

PAI-Studio: Cinematic Video Background Replacement with Camera-Aware Motion

cs.CV · 2026-05-31 · unverdicted · novelty 6.0

PAI-Studio reformulates cinematic background replacement as in-context conditional generation inside a Diffusion Transformer with bidirectional attention, trained on a new 30K film-sourced dataset, and reports better motion consistency and relighting than prior open-source and commercial systems.

citing papers explorer

Showing 2 of 2 citing papers after filters.

GeoEdit: Geometry-Aware Object Editing via Dual-Branch Denoising cs.CV · 2026-06-29 · unverdicted · none · ref 61 · internal anchor
GeoEdit introduces a Lift-Manipulate-Render-Denoise pipeline with dual-branch denoising and variance-homogeneous injection for 3D-consistent object editing in single photos.
PAI-Studio: Cinematic Video Background Replacement with Camera-Aware Motion cs.CV · 2026-05-31 · unverdicted · none · ref 46 · internal anchor
PAI-Studio reformulates cinematic background replacement as in-context conditional generation inside a Diffusion Transformer with bidirectional attention, trained on a new 30K film-sourced dataset, and reports better motion consistency and relighting than prior open-source and commercial systems.

VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers

fields

years

verdicts

representative citing papers

citing papers explorer