CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.
citing papers explorer
-
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
-
Agent AI: Surveying the Horizons of Multimodal Interaction
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.