VISTA introduces a new synthetic triplet dataset and diffusion-transformer framework with style adapter that jointly models style, content, and motion to achieve state-of-the-art video style transfer.
Makeanything: Harnessing diffusion trans- formers for multi-domain procedural sequence generation
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 6years
2026 6verdicts
UNVERDICTED 6roles
background 3polarities
background 3representative citing papers
OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.
EditTransfer++ delivers state-of-the-art faithfulness to visual editing examples and faster inference by removing text conditioning during fine-tuning and applying best-worst contrastive refinement plus condition compression.
EasyVFX decouples VFX generation via frequency-aware Mixture-of-Experts and test-time training to achieve realistic effects with limited resources.
SWEET is a one-shot sparse visual planning framework that progressively generates manipulation keyframes via image editing conditioned on language and spatial guidance, then converts them to actions with a diffusion predictor, showing better fidelity and lower cost than video models on DROID and Rob
A survey that organizes diffusion image-to-video methods into a taxonomy, distills core designs in condition encoding, temporal modeling, noise prior, and upsampling, and discusses applications plus challenges.
citing papers explorer
-
VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers
VISTA introduces a new synthetic triplet dataset and diffusion-transformer framework with style adapter that jointly models style, content, and motion to achieve state-of-the-art video style transfer.
-
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.
-
EditTransfer++: Toward Faithful and Efficient Visual-Prompt-Guided Image Editing
EditTransfer++ delivers state-of-the-art faithfulness to visual editing examples and faster inference by removing text conditioning during fine-tuning and applying best-worst contrastive refinement plus condition compression.
-
EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation
EasyVFX decouples VFX generation via frequency-aware Mixture-of-Experts and test-time training to achieve realistic effects with limited resources.
-
SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution
SWEET is a one-shot sparse visual planning framework that progressively generates manipulation keyframes via image editing conditioned on language and spatial guidance, then converts them to actions with a diffusion predictor, showing better fidelity and lower cost than video models on DROID and Rob
-
Image-to-Video Diffusion: From Foundations to Open Frontiers
A survey that organizes diffusion image-to-video methods into a taxonomy, distills core designs in condition encoding, temporal modeling, noise prior, and upsampling, and discusses applications plus challenges.