hub

Aniportrait: Audio-driven synthesis of photorealistic portrait animation

· 2024 · arXiv 2403.17694

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 dataset 1

citation-polarity summary

background 3 use dataset 1

representative citing papers

Giving Faces Their Feelings Back: Explicit Emotion Control for Feedforward Single-Image 3D Head Avatars

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

A dual-path modulation technique injects independent emotion control into existing feed-forward single-image 3D head avatar pipelines while preserving reconstruction quality.

AvatarPointillist: AutoRegressive 4D Gaussian Avatarization

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

AvatarPointillist autoregressively generates adaptive 3D point clouds via Transformer for photorealistic 4D Gaussian avatars from one image, jointly predicting animation bindings and using a conditioned Gaussian decoder.

UIKA: Fast Universal Head Avatar from Pose-Free Images

cs.CV · 2026-01-12 · conditional · novelty 7.0

UIKA is a feed-forward animatable Gaussian head model using UV-guided correspondence estimation and learnable UV tokens with dual-level attention, trained on large-scale synthetic data to handle pose-free inputs.

ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

cs.CV · 2025-12-16 · unverdicted · novelty 7.0

ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.

FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling

cs.CV · 2025-09-15 · unverdicted · novelty 7.0

Phoneme-guided autoregressive framework for talking-head animation that reduces inter-frame flicker via causal keyframe generation and timestamp-aware interpolation, outperforming diffusion baselines on FVD and a new BG-Flicker metric.

AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

cs.LG · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.

MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

MMControl adds multi-modal controls for identity, timbre, pose, and layout to unified audio-video diffusion models via dual-stream injection and adjustable guidance scaling.

PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

PianoFlow generates coordinated bimanual piano motions from audio via MIDI-distilled flow-matching, asymmetric role-gated interaction, and autoregressive streaming continuation, outperforming priors with 9x faster inference.

SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.

JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

cs.CV · 2025-06-30 · unverdicted · novelty 6.0

JAM-Flow introduces a unified flow-matching model with a Multi-Modal Diffusion Transformer that jointly synthesizes facial motion and speech from text, audio, or motion inputs.

Multimodal Diffusion Transformer with Memory Bank for Scalable Long-Duration Talking Video Generation

cs.CV · 2024-11-24 · unverdicted · novelty 6.0

LetsTalk combines a multimodal diffusion transformer, noise-regularized memory bank, deep compression autoencoder, and symbiotic/direct fusion schemes to achieve state-of-the-art quality and efficiency in long-duration talking video generation.

Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection

cs.CV · 2026-05-02 · unverdicted · novelty 5.0

Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.

PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment

cs.CV · 2026-04-21 · unverdicted · novelty 5.0

PortraitDirector uses hierarchical disentanglement of spatial physical motions and semantic emotions to deliver controllable, high-fidelity real-time facial reenactment at 20 FPS.

TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation

cs.CV · 2026-04-16 · unverdicted · novelty 5.0

TurboTalk uses progressive distillation from 4 steps to 1 step with distribution matching and adversarial training to achieve 120x faster single-step audio-driven talking avatar video generation.

JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation

cs.CV · 2024-11-14 · unverdicted · novelty 5.0

JoyVASA decouples static 3D facial representations from identity-independent dynamic motion sequences generated by a diffusion transformer to produce audio-driven animations for humans and animals.

EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

cs.CV · 2026-02-14 · unverdicted · novelty 4.0

EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consistency and audio-lip sync.

Image-to-Video Diffusion: From Foundations to Open Frontiers

cs.CV · 2026-05-17 · unverdicted · novelty 3.0

A survey that organizes diffusion image-to-video methods into a taxonomy, distills core designs in condition encoding, temporal modeling, noise prior, and upsampling, and discusses applications plus challenges.

citing papers explorer

Showing 17 of 17 citing papers.

Giving Faces Their Feelings Back: Explicit Emotion Control for Feedforward Single-Image 3D Head Avatars cs.CV · 2026-04-16 · unverdicted · none · ref 78
A dual-path modulation technique injects independent emotion control into existing feed-forward single-image 3D head avatar pipelines while preserving reconstruction quality.
AvatarPointillist: AutoRegressive 4D Gaussian Avatarization cs.CV · 2026-04-06 · unverdicted · none · ref 67
AvatarPointillist autoregressively generates adaptive 3D point clouds via Transformer for photorealistic 4D Gaussian avatars from one image, jointly predicting animation bindings and using a conditioned Gaussian decoder.
UIKA: Fast Universal Head Avatar from Pose-Free Images cs.CV · 2026-01-12 · conditional · none · ref 79
UIKA is a feed-forward animatable Gaussian head model using UV-guided correspondence estimation and learnable UV tokens with dual-level attention, trained on large-scale synthetic data to handle pose-free inputs.
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body cs.CV · 2025-12-16 · unverdicted · none · ref 113
ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.
FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling cs.CV · 2025-09-15 · unverdicted · none · ref 25
Phoneme-guided autoregressive framework for talking-head animation that reduces inter-frame flicker via causal keyframe generation and timestamp-aware interpolation, outperforming diffusion baselines on FVD and a new BG-Flicker metric.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation cs.LG · 2026-05-01 · unverdicted · none · ref 18 · 2 links
AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation cs.CV · 2026-04-21 · unverdicted · none · ref 19
MMControl adds multi-modal controls for identity, timbre, pose, and layout to unified audio-video diffusion models via dual-stream injection and adjustable guidance scaling.
PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination cs.CV · 2026-04-14 · unverdicted · none · ref 64
PianoFlow generates coordinated bimanual piano motions from audio via MIDI-distilled flow-matching, asymmetric role-gated interaction, and autoregressive streaming continuation, outperforming priors with 9x faster inference.
SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation cs.CV · 2026-04-09 · unverdicted · none · ref 54
SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.
JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching cs.CV · 2025-06-30 · unverdicted · none · ref 42
JAM-Flow introduces a unified flow-matching model with a Multi-Modal Diffusion Transformer that jointly synthesizes facial motion and speech from text, audio, or motion inputs.
Multimodal Diffusion Transformer with Memory Bank for Scalable Long-Duration Talking Video Generation cs.CV · 2024-11-24 · unverdicted · none · ref 2
LetsTalk combines a multimodal diffusion transformer, noise-regularized memory bank, deep compression autoencoder, and symbiotic/direct fusion schemes to achieve state-of-the-art quality and efficiency in long-duration talking video generation.
Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection cs.CV · 2026-05-02 · unverdicted · none · ref 83
Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.
PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment cs.CV · 2026-04-21 · unverdicted · none · ref 59
PortraitDirector uses hierarchical disentanglement of spatial physical motions and semantic emotions to deliver controllable, high-fidelity real-time facial reenactment at 20 FPS.
TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation cs.CV · 2026-04-16 · unverdicted · none · ref 27
TurboTalk uses progressive distillation from 4 steps to 1 step with distribution matching and adversarial training to achieve 120x faster single-step audio-driven talking avatar video generation.
JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation cs.CV · 2024-11-14 · unverdicted · none · ref 3
JoyVASA decouples static 3D facial representations from identity-independent dynamic motion sequences generated by a diffusion transformer to produce audio-driven animations for humans and animals.
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation cs.CV · 2026-02-14 · unverdicted · none · ref 86
EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consistency and audio-lip sync.
Image-to-Video Diffusion: From Foundations to Open Frontiers cs.CV · 2026-05-17 · unverdicted · none · ref 77
A survey that organizes diffusion image-to-video methods into a taxonomy, distills core designs in condition encoding, temporal modeling, noise prior, and upsampling, and discusses applications plus challenges.

Aniportrait: Audio-driven synthesis of photorealistic portrait animation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer