Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement in a joint audio-video generation framework.
hub Mixed citations
Liveportrait: Efficient portrait animation with stitching and retargeting control
Mixed citation behavior. Most common role is background (60%).
hub tools
citation-role summary
citation-polarity summary
fields
cs.CV 18representative citing papers
AvatarPointillist autoregressively generates adaptive 3D point clouds via Transformer for photorealistic 4D Gaussian avatars from one image, jointly predicting animation bindings and using a conditioned Gaussian decoder.
UIKA is a feed-forward animatable Gaussian head model using UV-guided correspondence estimation and learnable UV tokens with dual-level attention, trained on large-scale synthetic data to handle pose-free inputs.
ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.
Omni-Attribute is a new open-vocabulary image attribute encoder trained on semantically linked pairs with dual objectives to produce disentangled representations for personalization and compositional generation.
A pose-conditioned large-margin contrastive encoder isolates persistent biometric identity cues from transmitted latents in talking-head videoconferencing to flag impersonation attacks via cosine similarity without inspecting the output video.
Phoneme-guided autoregressive framework for talking-head animation that reduces inter-frame flicker via causal keyframe generation and timestamp-aware interpolation, outperforming diffusion baselines on FVD and a new BG-Flicker metric.
Durian introduces a dual-reference diffusion model trained via self-reconstruction on video frames to enable cross-identity attribute transfer in portrait animations, supporting multi-attribute composition and interpolation.
TOPOS creates high-fidelity 3D heads with fixed industry topology from single images via a specialized VAE with Perceiver Resampler and a rectified flow transformer.
Deepfake detectors act as alpha blending searchers; training solely on self-blended real images yields top cross-dataset generalization on 15 datasets without using synthetic deepfakes.
SocialDirector uses spatiotemporal actor masking and directional reweighting on cross-attention maps to reduce actor-action mismatches and improve target-directed interactions in generated multi-person videos.
MeshLAM reconstructs high-fidelity animatable textured mesh head avatars from a single image via a feed-forward dual shape-texture architecture with iterative GRU decoding and reprojection-based guidance.
ARGen generates high-fidelity dynamic facial expression videos using affective semantic injection and adaptive reinforcement diffusion to improve emotion recognition models facing data scarcity and long-tail distributions.
JAM-Flow introduces a unified flow-matching model with a Multi-Modal Diffusion Transformer that jointly synthesizes facial motion and speech from text, audio, or motion inputs.
EasyVFX decouples VFX generation via frequency-aware Mixture-of-Experts and test-time training to achieve realistic effects with limited resources.
PortraitDirector uses hierarchical disentanglement of spatial physical motions and semantic emotions to deliver controllable, high-fidelity real-time facial reenactment at 20 FPS.
JoyVASA decouples static 3D facial representations from identity-independent dynamic motion sequences generated by a diffusion transformer to produce audio-driven animations for humans and animals.
citing papers explorer
-
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling
Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement in a joint audio-video generation framework.
-
AvatarPointillist: AutoRegressive 4D Gaussian Avatarization
AvatarPointillist autoregressively generates adaptive 3D point clouds via Transformer for photorealistic 4D Gaussian avatars from one image, jointly predicting animation bindings and using a conditioned Gaussian decoder.
-
UIKA: Fast Universal Head Avatar from Pose-Free Images
UIKA is a feed-forward animatable Gaussian head model using UV-guided correspondence estimation and learnable UV tokens with dual-level attention, trained on large-scale synthetic data to handle pose-free inputs.
-
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.
-
Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
Omni-Attribute is a new open-vocabulary image attribute encoder trained on semantically linked pairs with dual objectives to produce disentangled representations for personalization and compositional generation.
-
Unmasking Puppeteers: Leveraging Biometric Leakage to Expose Impersonation in AI-Based Videoconferencing
A pose-conditioned large-margin contrastive encoder isolates persistent biometric identity cues from transmitted latents in talking-head videoconferencing to flag impersonation attacks via cosine similarity without inspecting the output video.
-
FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling
Phoneme-guided autoregressive framework for talking-head animation that reduces inter-frame flicker via causal keyframe generation and timestamp-aware interpolation, outperforming diffusion baselines on FVD and a new BG-Flicker metric.
-
Durian: Dual Reference Image-Guided Portrait Animation with Attribute Transfer
Durian introduces a dual-reference diffusion model trained via self-reconstruction on video frames to enable cross-identity attribute transfer in portrait animations, supporting multi-attribute composition and interpolation.
-
TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation
TOPOS creates high-fidelity 3D heads with fixed industry topology from single images via a specialized VAE with Perceiver Resampler and a rectified flow transformer.
-
The Alpha Blending Hypothesis: Compositing Shortcut in Deepfake Detection
Deepfake detectors act as alpha blending searchers; training solely on self-blended real images yields top cross-dataset generalization on 15 datasets without using synthetic deepfakes.
-
SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation
SocialDirector uses spatiotemporal actor masking and directional reweighting on cross-attention maps to reduce actor-action mismatches and improve target-directed interactions in generated multi-person videos.
-
MeshLAM: Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction
MeshLAM reconstructs high-fidelity animatable textured mesh head avatars from a single image via a feed-forward dual shape-texture architecture with iterative GRU decoding and reprojection-based guidance.
-
ARGen: Affect-Reinforced Generative Augmentation towards Vision-based Dynamic Emotion Perception
ARGen generates high-fidelity dynamic facial expression videos using affective semantic injection and adaptive reinforcement diffusion to improve emotion recognition models facing data scarcity and long-tail distributions.
-
JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching
JAM-Flow introduces a unified flow-matching model with a Multi-Modal Diffusion Transformer that jointly synthesizes facial motion and speech from text, audio, or motion inputs.
-
EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation
EasyVFX decouples VFX generation via frequency-aware Mixture-of-Experts and test-time training to achieve realistic effects with limited resources.
-
PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment
PortraitDirector uses hierarchical disentanglement of spatial physical motions and semantic emotions to deliver controllable, high-fidelity real-time facial reenactment at 20 FPS.
-
JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation
JoyVASA decouples static 3D facial representations from identity-independent dynamic motion sequences generated by a diffusion transformer to produce audio-driven animations for humans and animals.
- OpenWorldLib: A Unified Codebase and Definition of Advanced World Models