hub

Hallo2: Long-duration and high- resolution audio-driven portrait image animation.arXiv preprint arXiv:2410.07718

· 2024 · arXiv 2410.07718

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement in a joint audio-video generation framework.

FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling

cs.CV · 2025-09-15 · unverdicted · novelty 7.0

Phoneme-guided autoregressive framework for talking-head animation that reduces inter-frame flicker via causal keyframe generation and timestamp-aware interpolation, outperforming diffusion baselines on FVD and a new BG-Flicker metric.

AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

cs.LG · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.

EAD-Net: Emotion-Aware Talking Head Generation with Spatial Refinement and Temporal Coherence

cs.CV · 2026-04-25 · unverdicted · novelty 6.0

EAD-Net uses a diffusion model with new spatio-temporal attention, graph-based temporal reasoning, and LLM-derived semantic descriptions to generate emotionally expressive talking head videos with improved lip-sync and coherence over prior methods.

MeshLAM: Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

MeshLAM reconstructs high-fidelity animatable textured mesh head avatars from a single image via a feed-forward dual shape-texture architecture with iterative GRU decoding and reprojection-based guidance.

SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.

Multimodal Diffusion Transformer with Memory Bank for Scalable Long-Duration Talking Video Generation

cs.CV · 2024-11-24 · unverdicted · novelty 6.0

LetsTalk combines a multimodal diffusion transformer, noise-regularized memory bank, deep compression autoencoder, and symbiotic/direct fusion schemes to achieve state-of-the-art quality and efficiency in long-duration talking video generation.

JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation

cs.CV · 2024-11-14 · unverdicted · novelty 5.0

JoyVASA decouples static 3D facial representations from identity-independent dynamic motion sequences generated by a diffusion transformer to produce audio-driven animations for humans and animals.

Image-to-Video Diffusion: From Foundations to Open Frontiers

cs.CV · 2026-05-17 · unverdicted · novelty 3.0

A survey that organizes diffusion image-to-video methods into a taxonomy, distills core designs in condition encoding, temporal modeling, noise prior, and upsampling, and discusses applications plus challenges.

citing papers explorer

Showing 10 of 10 citing papers.

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation cs.CV · 2026-04-26 · unverdicted · none · ref 9
Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling cs.CV · 2026-04-26 · unverdicted · none · ref 3
Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement in a joint audio-video generation framework.
FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling cs.CV · 2025-09-15 · unverdicted · none · ref 4
Phoneme-guided autoregressive framework for talking-head animation that reduces inter-frame flicker via causal keyframe generation and timestamp-aware interpolation, outperforming diffusion baselines on FVD and a new BG-Flicker metric.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation cs.LG · 2026-05-01 · unverdicted · none · ref 3 · 2 links
AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
EAD-Net: Emotion-Aware Talking Head Generation with Spatial Refinement and Temporal Coherence cs.CV · 2026-04-25 · unverdicted · none · ref 9
EAD-Net uses a diffusion model with new spatio-temporal attention, graph-based temporal reasoning, and LLM-derived semantic descriptions to generate emotionally expressive talking head videos with improved lip-sync and coherence over prior methods.
MeshLAM: Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction cs.CV · 2026-04-23 · unverdicted · none · ref 9
MeshLAM reconstructs high-fidelity animatable textured mesh head avatars from a single image via a feed-forward dual shape-texture architecture with iterative GRU decoding and reprojection-based guidance.
SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation cs.CV · 2026-04-09 · unverdicted · none · ref 6
SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.
Multimodal Diffusion Transformer with Memory Bank for Scalable Long-Duration Talking Video Generation cs.CV · 2024-11-24 · unverdicted · none · ref 24
LetsTalk combines a multimodal diffusion transformer, noise-regularized memory bank, deep compression autoencoder, and symbiotic/direct fusion schemes to achieve state-of-the-art quality and efficiency in long-duration talking video generation.
JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation cs.CV · 2024-11-14 · unverdicted · none · ref 2
JoyVASA decouples static 3D facial representations from identity-independent dynamic motion sequences generated by a diffusion transformer to produce audio-driven animations for humans and animals.
Image-to-Video Diffusion: From Foundations to Open Frontiers cs.CV · 2026-05-17 · unverdicted · none · ref 74
A survey that organizes diffusion image-to-video methods into a taxonomy, distills core designs in condition encoding, temporal modeling, noise prior, and upsampling, and discusses applications plus challenges.

Hallo2: Long-duration and high- resolution audio-driven portrait image animation.arXiv preprint arXiv:2410.07718

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer