pith. sign in

arxiv: 2605.08729 · v2 · pith:S27EYZAZnew · submitted 2026-05-09 · 💻 cs.CV · cs.GR· cs.MM· cs.SD

Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

Pith reviewed 2026-06-30 23:25 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.MMcs.SD
keywords audio-video generationmultimodal harmonizationcross-modal synchronizationspeech and sound effectshuman-centric videodenoising schedules
0
0 comments X

The pith

Unison framework uses semantic-guided audio harmonization and bidirectional cross-modal forcing to align motion, speech, and environmental sound in generated videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Unison as a unified generation framework that addresses mismatches among motion, speech, and sound in human-centric videos. Existing models struggle because these modalities have different temporal structures, so speech often dominates or drifts out of sync with actions and effects. Unison decouples speech from sound effects inside the audio stream and applies semantic conditioning to recompose them, while a cross-modal forcing step lets cleaner signals steer noisier ones through separate denoising paths. If effective, this produces videos where lip movements match words, actions match impacts, and background sounds remain clear rather than muddy. A reader cares because realistic video synthesis requires these elements to cohere naturally instead of being generated in isolation.

Core claim

Unison achieves state-of-the-art performance in both audio perceptual quality and cross-modal synchronization by employing a semantic-guided harmonization strategy within the audio stream that decouples the generation of speech and sound-effect components, leveraging bidirectional audio cross-attention and semantic-conditioned gating for semantic-driven adaptive recomposition, and by proposing a bidirectional cross-modal forcing strategy for audio-motion synchronization where the cleaner modality guides the noisier one through decoupled denoising schedules reinforced by a progressive stabilization strategy.

What carries the argument

Semantic-guided harmonization strategy that decouples speech and sound via bidirectional audio cross-attention and semantic-conditioned gating, together with bidirectional cross-modal forcing that uses decoupled denoising schedules to let cleaner modalities guide noisier ones.

If this is right

  • Audio perceptual quality improves because speech no longer dominates mixed soundtracks.
  • Cross-modal synchronization improves because cleaner signals guide noisier ones during denoising.
  • Explicit multimodal harmonization becomes necessary for consistent human-centric video output.
  • Decoupled denoising schedules reduce drift between motion and audio tracks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decoupling idea could be tested on longer video clips to check whether stabilization holds over time.
  • Applying the forcing mechanism to text-conditioned generation might reduce mismatches between captions and visuals.
  • The approach leaves open whether similar guidance rules would help in non-human scenes such as nature documentaries.

Load-bearing premise

The semantic-guided harmonization and bidirectional forcing strategies will produce coherent outputs across modalities without introducing new mismatches or requiring dataset-specific tuning beyond what is described.

What would settle it

Side-by-side comparison of Unison-generated videos against baselines on a held-out set, measuring whether lip-speech alignment errors or action-sound timing offsets increase rather than decrease.

Figures

Figures reproduced from arXiv: 2605.08729 by Chi Zhang, Jiaxu Zhang, Quanyue Song, Shansong Liu, Shihao Cheng, Xiaolei Zhang, Xuelong Li, Zhigang Tu, Zhizhi Guo.

Figure 1
Figure 1. Figure 1: Overview of the key challenges and our approach. Left: [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Unison. Unison couples a video branch and an audio branch via bidirectional cross-attention. The audio branch employs a Semantic-Guided Har￾monization Strategy for independent speech and sound-effect generation, utilizing a Bidirectional Audio Cross-Attention (Bi-ACA) module to mutually refine speech and sound-effect features, effectively enhancing their respective clarity. At each interaction … view at source ↗
Figure 3
Figure 3. Figure 3: Bidirectional Cross-Modal Forcing strategy for audio-visual align [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison between Unison and the state-of-the-art methods, in￾cluding Universe-1 [37], UniAVGen [44] and MOVA [30]. to determine Perceptual Quality (PQ) and Content Usefulness (CU). To evalu￾ate speech-text alignment, we isolate vocal components via Mel-RoFormer [38] and compute the Word Error Rate (WER) using Whisper-large-v3 [26]. (3) For cross-modal consistency, we utilize CLAP [6] for audi… view at source ↗
Figure 5
Figure 5. Figure 5: Bidirectional Synthesis of Audio-to-Video and Video-to-Audio. acoustic components, including lip movements and impact transients. Our model maintains superior acoustic layering, ensuring intelligible speech without sup￾pressing salient environmental audio. Audio-to-Video and Video-to-Audio Generation. Unison leverages decoupled de￾noising schedules and bidirectional guidance to achieve precise modal transl… view at source ↗
Figure 6
Figure 6. Figure 6: Ablation experiments on the Semantic-Guided Audio Harmonization Strategy. w/o Bidirectional Cross-modal Forcing Strategy Speech Sound Effects Transcription Caption The city is so big, but where is our home? A young woman with long, vibrant blue hair is depicted in a medium profile shot, playing a digital piano on an outdoor balcony during the 'blue hour'. The background features a soft-focus urban cityscap… view at source ↗
Figure 7
Figure 7. Figure 7: Ablation experiments on the Bidirectional Cross-modal Forcing Strategy. rigorous T2AV and TI2AV assessment. As shown in [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Analysis of SCG gate behavior. (a) Layer-wise: gate polarization increases with model depth. (b) Timestep-wise: gate divergence intensifies as denoising pro￾gresses. (c) Instance-wise: mean gate values across semantic categories, demonstrat￾ing content-adaptive modulation. SCG mitigates the dominance of speech over subtle environmental textures via dynamic rebalancing. In sports broadcasting, the mechanism… view at source ↗
Figure 9
Figure 9. Figure 9: Results of the user study User Study. We conducted a user study with 10 video samples and 25 participants from diverse backgrounds, evaluating lip￾speech synchrony, speech-sound harmony, and motion-audio alignment (considering both speech and environmental sounds). Participants were required to rank shuf￾fled videos across different methods, in￾cluding UniAVGen [44], MOVA [30], and LTX-2 [10]. As shown in … view at source ↗
read the original abstract

Motion, speech, and sound effects are fundamental elements of human-centric videos, yet their heterogeneous temporal characteristics make joint generation highly challenging. Existing audio-video generation models often fail to maintain consistent alignment across these modalities, leading to noticeable mismatches between motion, speech, and environmental sounds. We present Unison, a unified framework that explicitly promotes coherence across the motion, speech, and sound modalities. Within the audio stream, Unison employs a semantic-guided harmonization strategy that decouples the generation of speech and sound-effect components. Leveraging bidirectional audio cross-attention and semantic-conditioned gating for semantic-driven adaptive recomposition, this approach effectively mitigates speech dominance and enhances acoustic clarity. For audio-motion synchronization, we propose a bidirectional cross-modal forcing strategy where the cleaner modality guides the noisier one through decoupled denoising schedules, reinforced by a progressive stabilization strategy. Extensive experiments demonstrate that Unison achieves state-of-the-art performance in both audio perceptual quality and cross-modal synchronization, highlighting the importance of explicit multimodal harmonization in human-centric video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript presents Unison, a unified framework for human-centric audio-video generation that jointly produces motion, speech, and sound effects. It introduces a semantic-guided harmonization strategy within the audio stream that decouples speech and sound-effect generation via bidirectional audio cross-attention and semantic-conditioned gating, and a bidirectional cross-modal forcing strategy that uses decoupled denoising schedules plus progressive stabilization to align audio with motion. The central claim is that these explicit harmonization mechanisms yield state-of-the-art performance in audio perceptual quality and cross-modal synchronization.

Significance. If the empirical results hold, the explicit decoupling and bidirectional guidance mechanisms address a recognized limitation in existing multimodal generators where heterogeneous temporal characteristics lead to mismatches. The work provides concrete, implementable strategies that could be adopted or extended in subsequent audio-video synthesis research.

minor comments (3)
  1. Abstract: the claim of 'state-of-the-art performance' would be strengthened by naming the primary quantitative metrics (e.g., FAD, SyncNet score) and the number of baselines compared.
  2. The manuscript would benefit from a short pseudocode block or diagram illustrating the interaction between the semantic-conditioned gating and the bidirectional cross-attention layers.
  3. Section 5 (experiments): ensure that all reported numbers include standard deviations across multiple random seeds or runs.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the significance of the explicit decoupling and bidirectional guidance mechanisms, and the recommendation for minor revision. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity; empirical framework with independent mechanisms

full rationale

The paper introduces Unison as a unified framework employing semantic-guided harmonization (via bidirectional audio cross-attention and semantic-conditioned gating) and bidirectional cross-modal forcing (with decoupled denoising schedules and progressive stabilization). These are described as explicit strategies to address modality mismatches, with SOTA claims resting on extensive experiments rather than any derivation chain. No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes smuggled via prior work appear in the text. The central claims are presented as empirical outcomes of the proposed architecture, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input yields no identifiable free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.1-grok · 5748 in / 925 out tokens · 20771 ms · 2026-06-30T23:25:09.402238+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DART: Difficulty-Adaptive Routing for Zero-Shot Video Temporal Grounding

    cs.CV 2026-07 unverdicted novelty 7.0

    DART routes zero-shot video temporal grounding queries by difficulty using DPP entropy, achieving up to 3.5 mIoU gains with 7x fewer frames on Charades-STA and ActivityNet Captions.

  2. InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars

    cs.CV 2026-06 unverdicted novelty 6.0

    InteractiveAvatar uses autoregressive distillation, Long-Short Visual Memory, and a Reasoning-Reaction Module to enable real-time, consistent, intent-aware avatar video streaming.

  3. InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars

    cs.CV 2026-06 unverdicted novelty 6.0

    InteractiveAvatar is a real-time infinite-streaming avatar video generation system using autoregressive distillation, Long-Short Visual Memory for consistency, and a Reasoning-Reaction Module for intent-aware interactions.

Reference graph

Works this paper leans on

50 extracted references · 30 canonical work pages · cited by 2 Pith papers · 9 internal anchors

  1. [1]

    YouTube-8M: A Large-Scale Video Classification Benchmark

    Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, A.P., Toderici, G., Varadarajan, B., Vijayanarasimhan, S.: Youtube-8m: A large-scale video classification benchmark. In: arXiv:1609.08675 (2016),https://arxiv.org/pdf/1609.08675v1.pdf

  2. [2]

    An, H., Hu, W., Huang, S., Huang, S., Li, R., Liang, Y., Shao, J., Song, Y., Wang, Z., Yuan, C., Zhang, C., Zhang, H., Zhuang, W., Li, X.: Ai flow: Perspectives, scenarios, and approaches (2025),https://arxiv.org/abs/2506.12479

  3. [3]

    Chen, B., Monso, D.M., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Dif- fusion forcing: Next-token prediction meets full-sequence diffusion (2024),https: //arxiv.org/abs/2407.01392

  4. [4]

    In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2020)

    Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: Vggsound: A large-scale audio-visual dataset. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2020)

  5. [5]

    In: CVPR (2025)

    Cheng, H.K., Ishii, M., Hayakawa, A., Shibuya, T., Schwing, A., Mitsufuji, Y.: MMAudio: Taming multimodal joint training for high-quality video-to-audio syn- thesis. In: CVPR (2025)

  6. [6]

    In: ICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Elizalde, B., Deshmukh, S., Al Ismail, M., Wang, H.: Clap learning audio con- cepts from natural language supervision. In: ICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)

  7. [7]

    In: Proc

    Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio set: An ontology and human-labeled dataset for audio events. In: Proc. IEEE ICASSP 2017. New Orleans, LA (2017)

  8. [8]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15180– 15190 (2023)

  9. [9]

    Google DeepMind: Veo: A text-to-video generation system (2025),https:// storage.googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf

  10. [10]

    HaCohen, Y., Brazowski, B., Chiprut, N., Bitterman, Y., Kvochko, A., Berkowitz, A., Shalem, D., Lifschitz, D., Moshe, D., Porat, E., Richardson, E., Shiran, G., Chachy, I., Chetboun, J., Finkelson, M., Kupchick, M., Zabari, N., Guetta, N., Kotler, N., Bibi, O., Gordon, O., Panet, P., Benita, R., Armon, S., Kulikov, V., Inger,Y.,Shiftan,Y.,Melumian,Z.,Farb...

  11. [11]

    arXiv preprint arXiv:2511.21579 (2025)

    Hu, T., Yu, Z., Zhang, G., Su, Z., Zhou, Z., Zhang, Y., Zhou, Y., Lu, Q., Yi, R.: Harmony: Harmonizing audio and video generation through cross-task synergy. arXiv preprint arXiv:2511.21579 (2025)

  12. [12]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion (2025),https://arxiv.org/abs/ 2506.08009 16 S. Cheng et al

  13. [13]

    In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Iashin, V., Xie, W., Rahtu, E., Zisserman, A.: Synchformer: Efficient synchroniza- tion from sparse cues. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5325–5329. IEEE (2024)

  14. [14]

    Vicinagearth1(1), 8 (2024)

    Jiang, W., Zhang, Y., Zheng, S., Liu, S., Yan, S.: Data augmentation in human- centric vision. Vicinagearth1(1), 8 (2024)

  15. [15]

    Li, H., Xu, M., Zhan, Y., Mu, S., Li, J., Cheng, K., Chen, Y., Chen, T., Ye, M., Wang, J., Zhu, S.: Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation (2025),https://arxiv.org/abs/2412.00115

  16. [16]

    AsurveyonLLM-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth1, 1 (Oct

    Li, X., Wang, S., Zeng, S., et al.: A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges. Vicinagearth1(9) (2024).https://doi. org/10.1007/s44336-024-00009-2

  17. [17]

    IEEE Transactions on Neural Networks and Learn- ing Systems35(6), 8708–8714 (2024).https://doi.org/10.1109/TNNLS.2022

    Li, X.: Positive-incentive noise. IEEE Transactions on Neural Networks and Learn- ing Systems35(6), 8708–8714 (2024).https://doi.org/10.1109/TNNLS.2022. 3224577

  18. [18]

    Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling (2023),https://arxiv.org/abs/2210.02747

  19. [19]

    Liu, H., Lan, G.L., Mei, X., Ni, Z., Kumar, A., Nagaraja, V., Wang, W., Plumbley, M.D., Shi, Y., Chandra, V.: Syncflow: Toward temporally aligned joint audio-video generation from text (2024),https://arxiv.org/abs/2412.15220

  20. [20]

    Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025

    Liu, K., Li, W., Chen, L., Wu, S., Zheng, Y., Ji, J., Zhou, F., Jiang, R., Luo, J., Fei, H., et al.: Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization. arXiv preprint arXiv:2503.23377 (2025)

  21. [21]

    Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time (2025),https://arxiv.org/abs/2509.25161

  22. [22]

    Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

    Low, C., Wang, W., Katyal, C.: Ovi: Twin backbone cross-modal fusion for audio- video generation. arXiv preprint arXiv:2510.01284 (2025)

  23. [23]

    IEEE/ACM Transactions on Au- dio, Speech, and Language Processing pp

    Mei, X., Meng, C., Liu, H., Kong, Q., Ko, T., Zhao, C., Plumbley, M.D., Zou, Y., Wang, W.: WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. IEEE/ACM Transactions on Au- dio, Speech, and Language Processing pp. 1–15 (2024)

  24. [24]

    OpenAI: Sora 2 system card (2025),https://cdn.openai.com/pdf/50d5973c- c4ff-4c2d-986f-c72b5d0ff069/sora_2_system_card.pdf

  25. [25]

    In: Proceedings of the 28th ACM International Conference on Multimedia

    Prajwal, K.R., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia. p. 484–492. MM ’20, ACM (Oct 2020).https://doi.org/10.1145/3394171.3413532,http://dx.doi.org/ 10.1145/3394171.3413532

  26. [26]

    In: International conference on machine learning

    Radford,A.,Kim,J.W.,Xu,T.,Brockman,G.,McLeavey,C.,Sutskever,I.:Robust speech recognition via large-scale weak supervision. In: International conference on machine learning. pp. 28492–28518. PMLR (2023)

  27. [27]

    In: CVPR (2023)

    Ruan, L., Ma, Y., Yang, H., He, H., Liu, B., Fu, J., Yuan, N.J., Jin, Q., Guo, B.: Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In: CVPR (2023)

  28. [28]

    Advances in neural information processing systems35, 25278–25294 (2022)

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large- scale dataset for training next generation image-text models. Advances in neural information processing systems35, 25278–25294 (2022)

  29. [29]

    Vicinagearth2(9) (2025).https://doi.org/10.1007/ s44336-025-00018-9 Abbreviated paper title 17

    Shen, Y., Zhang, D.: A survey of language-guided video object segmentation: from referring to reasoning. Vicinagearth2(9) (2025).https://doi.org/10.1007/ s44336-025-00018-9 Abbreviated paper title 17

  30. [30]

    Corresponding authors: Xie Chen and Xipeng Qiu

    SII-OpenMOSS Team, Yu, D., Chen, M., Chen, Q., Luo, Q., Wu, Q., Cheng, Q., Li, R., Liang, T., Zhang, W., Tu, W., Peng, X., Gao, Y., Huo, Y., Zhu, Y., Luo, Y., Zhang, Y., Song, Y., Xu, Z., Zhang, Z., Yang, C., Chang, C., Zhou, C., Chen, H., Ma, H., Li, J., Tong, J., Liu, J., Chen, K., Li, S., Wang, S., Jiang, W., Fei, Z., Ning, Z., Li, C., Li, C., He, Z., ...

  31. [31]

    DINOv3

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

  32. [32]

    Song, K., Chen, B., Simchowitz, M., Du, Y., Tedrake, R., Sitzmann, V.: History- guided video diffusion (2025),https://arxiv.org/abs/2502.06764

  33. [33]

    Team, O., Yu, D., Chen, M., Chen, Q., Luo, Q., Wu, Q., Cheng, Q., Li, R., Liang, T., Zhang, W., Tu, W., Peng, X., Gao, Y., Huo, Y., Zhu, Y., Luo, Y., Zhang, Y., Song, Y., Xu, Z., Zhang, Z., Yang, C., Chang, C., Zhou, C., Chen, H., Ma, H., Li, J., Tong, J., Liu, J., Chen, K., Li, S., Jiang, S., Wang, S., Jiang, W., Fei, Z., Ning, Z., Li, C., Li, C., He, Z....

  34. [34]

    In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

    Tian, Z., Liu, Z., Yuan, R., Pan, J., Liu, Q., Tan, X., Chen, Q., Xue, W., Guo, Y.: Vidmuse: A simple video-to-music generation framework with long-short-term modeling. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 18782–18793 (2025)

  35. [35]

    Audiobox: Unified audio generation with natural language prompts,

    Vyas, A., Shi, B., Le, M., Tjandra, A., Wu, Y.C., Guo, B., Zhang, J., Zhang, X., Adkins, R., Ngan, W., et al.: Audiobox: Unified audio generation with natural language prompts. arXiv preprint arXiv:2312.15821 (2023)

  36. [36]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

  37. [37]

    Universe-1: Unified audio-video generation via stitching of experts.arXiv preprint arXiv:2509.06155, 2025

    Wang, D., Zuo, W., Li, A., Chen, L.H., Liao, X., Zhou, D., Yin, Z., Dai, X., Jiang, D., Yu, G.: Universe-1: Unified audio-video generation via stitching of experts. arXiv preprint arXiv:2509.06155 (2025)

  38. [38]

    Wang, J.C., Lu, W.T., Chen, J.: Mel-roformer for vocal separation and vocal melody transcription (2024),https://arxiv.org/abs/2409.04702

  39. [39]

    Wang, L.X.X., Zhang, H., Dong, C., Shan, Y.: Vfhq: A high-quality dataset and benchmark for video face super-resolution (2022),https://arxiv.org/abs/2205. 03409

  40. [40]

    Advances in Neural Information Processing Systems37, 65618–65642 (2024)

    Wang, W., Yang, Y.: Vidprom: A million-scale real prompt-gallery dataset for text- to-video diffusion models. Advances in Neural Information Processing Systems37, 65618–65642 (2024)

  41. [41]

    In: CVPR (2023)

    Yu, J., Zhu, H., Jiang, L., Loy, C.C., Cai, W., Wu, W.: CelebV-Text: A large-scale facial text-video dataset. In: CVPR (2023)

  42. [42]

    Cheng et al

    Yuan, R., Lin, H., Guo, S., Zhang, G., Pan, J., Zang, Y., Liu, H., Du, X., Du, X., Ye, Z., Zheng, T., Jiang, Z., Ma, Y., Liu, M., Yu, L., Tian, Z., Zhou, Z., Xue, L., Qu, X., Li, Y., Shen, T., Ma, Z., Wu, S., Zhan, J., Wang, C., Wang, Y., Zhou, X., Chi, X., Zhang, X., Yang, Z., Liang, Y., Wang, X., Liu, S., Mei, L., Li, P., Chen, Y., Lin, C., Chen, X., Xi...

  43. [43]

    Yuan, R., Lin, H., Guo, S., Zhang, G., Pan, J., Zang, Y., Liu, H., Liang, Y., Ma, W., Du, X., Du, X., Ye, Z., Zheng, T., Jiang, Z., Ma, Y., Liu, M., Tian, Z., Zhou, Z., Xue, L., Qu, X., Li, Y., Wu, S., Shen, T., Ma, Z., Zhan, J., Wang, C., Wang, Y., Chi, X., Zhang, X., Yang, Z., Wang, X., Liu, S., Mei, L., Li, P., Wang, J., Yu, J., Pang, G., Li, X., Wang,...

  44. [44]

    arXiv preprint arXiv:2511.03334 (2025)

    Zhang, G., Zhou, Z., Hu, T., Peng, Z., Zhang, Y., Chen, Y., Zhou, Y., Lu, Q., Wang, L.: Uniavgen: Unified audio and video generation with asymmetric cross- modal interactions. arXiv preprint arXiv:2511.03334 (2025)

  45. [45]

    IEEE Transactions on Pattern Analysis and Machine Intelli- gence47(9), 8313–8320 (2025).https://doi.org/10.1109/TPAMI.2025.3575295

    Zhang, H., Huang, S., Guo, Y., Li, X.: Variational positive-incentive noise: How noise benefits models. IEEE Transactions on Pattern Analysis and Machine Intelli- gence47(9), 8313–8320 (2025).https://doi.org/10.1109/TPAMI.2025.3575295

  46. [46]

    arXiv preprint arXiv:2412.16563 (2024)

    Zhang, X., Li, J., Zhang, J., Dang, Z., Ren, J., Bo, L., Tu, Z.: Semtalk: Holistic co-speech motion generation with frame-level semantic emphasis. arXiv preprint arXiv:2412.16563 (2024)

  47. [47]

    Zhang, X., Li, J., Zhang, J., Ren, J., Bo, L., Tu, Z.: Echomask: Speech-queried attention-based mask modeling for holistic co-speech motion generation (2025), https://arxiv.org/abs/2504.09209

  48. [48]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3661–3670 (2021)

  49. [49]

    Zhao, L., Feng, L., Ge, D., Chen, R., Yi, F., Zhang, C., Zhang, X.L., Li, X.: Uni- form: A unified multi-task diffusion transformer for audio-video generation (2025), https://arxiv.org/abs/2502.03897

  50. [50]

    ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching.arXiv preprint arXiv:2506.13053, 2025

    Zhu, H., Kang, W., Yao, Z., Guo, L., Kuang, F., Li, Z., Zhuang, W., Lin, L., Povey, D.: Zipvoice: Fast and high-quality zero-shot text-to-speech with flow matching. arXiv preprint arXiv:2506.13053 (2025)