MultiAnimate adds Identifier Assigner and Identifier Adapter modules to diffusion video models so they can handle multiple characters without identity mix-ups, generalizing from two-character training data to more characters.
arXiv preprint arXiv:2412.09645 , year =
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 7roles
background 1polarities
background 1representative citing papers
TunerDiT adds event-partitioned masking and cross-event prompt fusion to diffusion transformers for training-free multi-event video generation, with gains scaling by event count on a new Meve benchmark.
Fine-tuning text-to-video models on sparse low-quality synthetic data for physical camera controls outperforms fine-tuning on photorealistic data.
EditVerse unifies image and video editing and generation in one transformer model via unified token sequences and in-context learning, trained jointly on curated video editing data plus image/video corpora and evaluated on a new instruction-based benchmark.
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs, and anomaly detection methods.
Hybrid system that uses ray-traced 3D Gaussians to supply radiometric guidance and material regularization to a neural renderer for editable, realistic output from captured scenes.
A new keyframe selection framework combines structural, tracking, and semantic criteria to select reliable anchor frames for diffusion-based video editing under occlusion.
citing papers explorer
-
MultiAnimate: Pose-Guided Image Animation Made Extensible
MultiAnimate adds Identifier Assigner and Identifier Adapter modules to diffusion video models so they can handle multiple characters without identity mix-ups, generalizing from two-character training data to more characters.
-
TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation
TunerDiT adds event-partitioned masking and cross-event prompt fusion to diffusion transformers for training-free multi-event video generation, with gains scaling by event count on a new Meve benchmark.
-
Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation
Fine-tuning text-to-video models on sparse low-quality synthetic data for physical camera controls outperforms fine-tuning on photorealistic data.
-
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
EditVerse unifies image and video editing and generation in one transformer model via unified token sequences and in-context learning, trained jointly on curated video editing data plus image/video corpora and evaluated on a new instruction-based benchmark.
-
TRON: Tracing Rays to Orchestrate a Neural Renderer for 3D Gaussian Reconstructions
Hybrid system that uses ray-traced 3D Gaussians to supply radiometric guidance and material regularization to a neural renderer for editable, realistic output from captured scenes.
-
Occlusion-Aware Physics-Semantic Keyframe Selection for Robust Video Editing
A new keyframe selection framework combines structural, tracking, and semantic criteria to select reliable anchor frames for diffusion-based video editing under occlusion.