NAVA proposes native audio-visual alignment via Align-then-Fuse MMDiT and Timbre-in-Context Conditioning for joint audio-video generation with improved synchronization and timbre control.
Dreamid-omni: Unified framework for controllable human-centric audio-video generation
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6verdicts
UNVERDICTED 6roles
background 1polarities
background 1representative citing papers
AVBench is a benchmark for human-centric AV generation evaluation featuring ten fine-grained dimensions and preference-learned evaluators that output continuous probabilistic scores from binary decisions.
MSAVBench is the first comprehensive benchmark for multi-shot audio-video generation featuring four dimensions, challenging scenarios, and an adaptive hybrid evaluation framework that achieves 91.5% Spearman correlation with human judgments.
MTAVG-Bench 2.0 is a new benchmark that evaluates omni LLMs on diagnosing high-level cinematic failures in multi-talker audio-video generation using a taxonomy of acting, narrative, atmosphere, and audio-visual language.
CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.
Omni-Customizer proposes an end-to-end framework using Omni-Context Fusion, Masked TTS Cross-Attention, Semantic-Anchored Multimodal RoPE, and specialized training curricula to achieve precise multimodal identity binding in joint audio-video generation.
citing papers explorer
No citing papers match the current filters.