hub Canonical reference

OmniAvatar: Efficient audio-driven avatar video generation with adaptive body animation

URL https://arxiv · 2025 · arXiv 2506.18866

Canonical reference. 71% of citing Pith papers cite this work as background.

16 Pith papers citing it

Background 71% of classified citations

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 baseline 1 method 1

citation-polarity summary

background 5 baseline 1 use method 1

representative citing papers

TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

cs.SD · 2026-05-03 · unverdicted · novelty 7.0

TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rhythm while a new baseline performs competitively.

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.

Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels

cs.AI · 2026-04-11 · unverdicted · novelty 7.0

Multi-head Gaussian kernels inject temporal scale discrepancy as inductive bias to enable full-duplex talking-listening avatar generation, supported by a new decoupled VoxHear dataset and claimed SOTA naturalness.

AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

cs.LG · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.

Generate Your Talking Avatar from Video Reference

cs.CV · 2026-04-30 · unverdicted · novelty 6.0

TAVR generates high-fidelity talking avatars from cross-scene video references via token selection and three-stage training (same-scene pretraining, cross-scene fine-tuning, identity RL), outperforming baselines on a new 158-pair benchmark.

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

cs.CV · 2026-04-28 · unverdicted · novelty 6.0

Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.

OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

OmniHuman is a new large-scale multi-scene dataset with video-, frame-, and individual-level annotations for human-centric video generation, accompanied by the OHBench benchmark that adds metrics aligned with human perception.

PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

PianoFlow generates coordinated bimanual piano motions from audio via MIDI-distilled flow-matching, asymmetric role-gated interaction, and autoregressive streaming continuation, outperforming priors with 9x faster inference.

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

cs.CV · 2025-12-04 · conditional · novelty 6.0

Live Avatar enables 45 FPS real-time streaming infinite-length audio-driven avatar generation from a 14B diffusion model via distillation and timestep-forcing pipeline parallelism.

THEval. Evaluation Framework for Talking Head Video Generation

cs.CV · 2025-11-06 · conditional · novelty 6.0

THEval proposes eight metrics for evaluating talking head videos on quality, naturalness, and synchronization, tested on 85,000 videos from 17 models with a new curated dataset.

From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation

cs.GR · 2026-04-26 · unverdicted · novelty 5.0 · 2 refs

The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.

Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

cs.CV · 2026-04-10 · unverdicted · novelty 5.0

Tora3 uses shared object trajectories as kinematic priors to jointly guide visual motion and acoustic events in audio-video generation, improving realism and synchronization.

EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

cs.CV · 2026-02-14 · unverdicted · novelty 4.0

EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consistency and audio-lip sync.

Image-to-Video Diffusion: From Foundations to Open Frontiers

cs.CV · 2026-05-17 · unverdicted · novelty 3.0

A survey that organizes diffusion image-to-video methods into a taxonomy, distills core designs in condition encoding, temporal modeling, noise prior, and upsampling, and discusses applications plus challenges.

citing papers explorer

Showing 16 of 16 citing papers.

TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation cs.SD · 2026-05-03 · unverdicted · none · ref 5
TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rhythm while a new baseline performs competitively.
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation cs.CV · 2026-04-26 · unverdicted · none · ref 11
Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.
Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels cs.AI · 2026-04-11 · unverdicted · none · ref 7
Multi-head Gaussian kernels inject temporal scale discrepancy as inductive bias to enable full-duplex talking-listening avatar generation, supported by a new decoupled VoxHear dataset and claimed SOTA naturalness.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation cs.LG · 2026-05-01 · unverdicted · none · ref 7 · 2 links
AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
Generate Your Talking Avatar from Video Reference cs.CV · 2026-04-30 · unverdicted · none · ref 17
TAVR generates high-fidelity talking avatars from cross-scene video references via token selection and three-stage training (same-scene pretraining, cross-scene fine-tuning, identity RL), outperforming baselines on a new 158-pair benchmark.
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation cs.CV · 2026-04-28 · unverdicted · none · ref 12
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation cs.CV · 2026-04-21 · unverdicted · none · ref 9
CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.
OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation cs.CV · 2026-04-20 · unverdicted · none · ref 13
OmniHuman is a new large-scale multi-scene dataset with video-, frame-, and individual-level annotations for human-centric video generation, accompanied by the OHBench benchmark that adds metrics aligned with human perception.
PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination cs.CV · 2026-04-14 · unverdicted · none · ref 17
PianoFlow generates coordinated bimanual piano motions from audio via MIDI-distilled flow-matching, asymmetric role-gated interaction, and autoregressive streaming continuation, outperforming priors with 9x faster inference.
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation cs.CV · 2026-04-13 · unverdicted · none · ref 18
OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length cs.CV · 2025-12-04 · conditional · none · ref 13
Live Avatar enables 45 FPS real-time streaming infinite-length audio-driven avatar generation from a 14B diffusion model via distillation and timestep-forcing pipeline parallelism.
THEval. Evaluation Framework for Talking Head Video Generation cs.CV · 2025-11-06 · conditional · none · ref 3
THEval proposes eight metrics for evaluating talking head videos on quality, naturalness, and synchronization, tested on 85,000 videos from 17 models with a new curated dataset.
From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation cs.GR · 2026-04-26 · unverdicted · none · ref 234 · 2 links
The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.
Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence cs.CV · 2026-04-10 · unverdicted · none · ref 13
Tora3 uses shared object trajectories as kinematic priors to jointly guide visual motion and acoustic events in audio-video generation, improving realism and synchronization.
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation cs.CV · 2026-02-14 · unverdicted · none · ref 11
EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consistency and audio-lip sync.
Image-to-Video Diffusion: From Foundations to Open Frontiers cs.CV · 2026-05-17 · unverdicted · none · ref 79
A survey that organizes diffusion image-to-video methods into a taxonomy, distills core designs in condition encoding, temporal modeling, noise prior, and upsampling, and discusses applications plus challenges.

OmniAvatar: Efficient audio-driven avatar video generation with adaptive body animation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer