pith. machine review for the scientific record. sign in

arxiv: 2605.08729 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.GR· cs.MM· cs.SD

Recognition: 2 theorem links

· Lean Theorem

Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:23 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.MMcs.SD
keywords audio-video generationmultimodal harmonizationcross-modal synchronizationspeech and sound effectsdenoising scheduleshuman-centric videosemantic-guided strategy
0
0 comments X

The pith

Unison is a unified framework that harmonizes motion, speech, and sound in human-centric video generation through explicit multimodal strategies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Unison as a framework for creating videos that combine human motion, spoken dialogue, and environmental sounds without the mismatches common in prior models. It addresses the challenge of differing timing patterns across these elements by decoupling speech from sound effects in the audio generation process and using cross-attention to recompose them based on semantics. A separate forcing mechanism aligns the audio output with visual motion by having the less noisy signal guide the noisier one during generation steps. Experiments show gains in audio quality and timing accuracy over existing approaches. A reader would care because coherent multimodal output could make AI videos more usable for storytelling, training, or virtual environments.

Core claim

Unison is a unified framework that explicitly promotes coherence across the motion, speech, and sound modalities. Within the audio stream, it employs a semantic-guided harmonization strategy that decouples speech and sound-effect components using bidirectional audio cross-attention and semantic-conditioned gating to mitigate speech dominance and improve clarity. For audio-motion synchronization, it proposes a bidirectional cross-modal forcing strategy in which the cleaner modality guides the noisier one through decoupled denoising schedules reinforced by progressive stabilization. Extensive experiments demonstrate state-of-the-art performance in audio perceptual quality and cross-modal sync.

What carries the argument

Semantic-guided harmonization strategy with bidirectional audio cross-attention and semantic-conditioned gating, together with bidirectional cross-modal forcing through decoupled denoising schedules.

If this is right

  • Reduces mismatches between human motion, spoken words, and background sounds in the final video.
  • Decouples and recomposes audio components to lessen speech dominance while preserving environmental effects.
  • Enables the cleaner signal to steer the noisier one during generation for tighter cross-modal timing.
  • Delivers measurable gains in perceptual audio quality and synchronization accuracy over prior unified models.
  • Shows that progressive stabilization during denoising helps maintain overall coherence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The decoupling tactic could transfer to other generation tasks where one signal type tends to overpower others, such as text-to-image with added audio.
  • If the forcing strategy scales, it might reduce the need for separate post-processing stages in long-form video pipelines.
  • Applying the same bidirectional guidance principle to additional inputs like camera motion or lighting could yield even tighter scene consistency.
  • The emphasis on semantic gating suggests a general route for making diffusion-based multimodal models more controllable without extra labels.

Load-bearing premise

The assumption that the semantic-guided harmonization and bidirectional cross-modal forcing will reliably improve coherence without introducing new artifacts or requiring extensive post-hoc tuning on specific datasets.

What would settle it

A side-by-side human evaluation on held-out human-centric video clips where viewers consistently rate Unison outputs as having worse speech-sound balance or motion-audio timing than outputs from a strong baseline model.

Figures

Figures reproduced from arXiv: 2605.08729 by Chi Zhang, Jiaxu Zhang, Quanyue Song, Shansong Liu, Shihao Cheng, Xiaolei Zhang, Xuelong Li, Zhigang Tu, Zhizhi Guo.

Figure 1
Figure 1. Figure 1: Overview of the key challenges and our approach. Left: [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Unison. Unison couples a video branch and an audio branch via bidirectional cross-attention. The audio branch employs a Semantic-Guided Har￾monization Strategy for independent speech and sound-effect generation, utilizing a Bidirectional Audio Cross-Attention (Bi-ACA) module to mutually refine speech and sound-effect features, effectively enhancing their respective clarity. At each interaction … view at source ↗
Figure 3
Figure 3. Figure 3: Bidirectional Cross-Modal Forcing strategy for audio-visual align [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison between Unison and the state-of-the-art methods, in￾cluding Universe-1 [37], UniAVGen [44] and MOVA [30]. to determine Perceptual Quality (PQ) and Content Usefulness (CU). To evalu￾ate speech-text alignment, we isolate vocal components via Mel-RoFormer [38] and compute the Word Error Rate (WER) using Whisper-large-v3 [26]. (3) For cross-modal consistency, we utilize CLAP [6] for audi… view at source ↗
Figure 5
Figure 5. Figure 5: Bidirectional Synthesis of Audio-to-Video and Video-to-Audio. acoustic components, including lip movements and impact transients. Our model maintains superior acoustic layering, ensuring intelligible speech without sup￾pressing salient environmental audio. Audio-to-Video and Video-to-Audio Generation. Unison leverages decoupled de￾noising schedules and bidirectional guidance to achieve precise modal transl… view at source ↗
Figure 6
Figure 6. Figure 6: Ablation experiments on the Semantic-Guided Audio Harmonization Strategy. w/o Bidirectional Cross-modal Forcing Strategy Speech Sound Effects Transcription Caption The city is so big, but where is our home? A young woman with long, vibrant blue hair is depicted in a medium profile shot, playing a digital piano on an outdoor balcony during the 'blue hour'. The background features a soft-focus urban cityscap… view at source ↗
Figure 7
Figure 7. Figure 7: Ablation experiments on the Bidirectional Cross-modal Forcing Strategy. rigorous T2AV and TI2AV assessment. As shown in [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Analysis of SCG gate behavior. (a) Layer-wise: gate polarization increases with model depth. (b) Timestep-wise: gate divergence intensifies as denoising pro￾gresses. (c) Instance-wise: mean gate values across semantic categories, demonstrat￾ing content-adaptive modulation. SCG mitigates the dominance of speech over subtle environmental textures via dynamic rebalancing. In sports broadcasting, the mechanism… view at source ↗
Figure 9
Figure 9. Figure 9: Results of the user study User Study. We conducted a user study with 10 video samples and 25 participants from diverse backgrounds, evaluating lip￾speech synchrony, speech-sound harmony, and motion-audio alignment (considering both speech and environmental sounds). Participants were required to rank shuf￾fled videos across different methods, in￾cluding UniAVGen [44], MOVA [30], and LTX-2 [10]. As shown in … view at source ↗
read the original abstract

Motion, speech, and sound effects are fundamental elements of human-centric videos, yet their heterogeneous temporal characteristics make joint generation highly challenging. Existing audio-video generation models often fail to maintain consistent alignment across these modalities, leading to noticeable mismatches between motion, speech, and environmental sounds. We present Unison, a unified framework that explicitly promotes coherence across the motion, speech, and sound modalities. Within the audio stream, Unison employs a semantic-guided harmonization strategy that decouples the generation of speech and sound-effect components. Leveraging bidirectional audio cross-attention and semantic-conditioned gating for semantic-driven adaptive recomposition, this approach effectively mitigates speech dominance and enhances acoustic clarity. For audio-motion synchronization, we propose a bidirectional cross-modal forcing strategy where the cleaner modality guides the noisier one through decoupled denoising schedules, reinforced by a progressive stabilization strategy. Extensive experiments demonstrate that Unison achieves state-of-the-art performance in both audio perceptual quality and cross-modal synchronization, highlighting the importance of explicit multimodal harmonization in human-centric video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents Unison, a unified framework for generating motion, speech, and sound effects in human-centric videos. It employs a semantic-guided harmonization strategy in the audio stream using bidirectional audio cross-attention and semantic-conditioned gating to decouple speech and sound-effect generation, mitigating speech dominance. For audio-motion synchronization, it proposes bidirectional cross-modal forcing with decoupled denoising schedules and a progressive stabilization strategy. The authors claim that extensive experiments show state-of-the-art performance in audio perceptual quality and cross-modal synchronization.

Significance. If the empirical claims hold, this contribution would be significant in the field of audio-video generation by providing explicit mechanisms for multimodal coherence, which could improve the quality of generated human-centric content and influence subsequent research on handling heterogeneous modalities.

major comments (1)
  1. Abstract: The abstract asserts state-of-the-art results in audio perceptual quality and cross-modal synchronization but provides no quantitative metrics, baselines, ablation studies, or error analysis to support this claim. The soundness of the central empirical claim cannot be verified without the full results and methods sections.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need to substantiate the empirical claims. We address the major comment point by point below.

read point-by-point responses
  1. Referee: Abstract: The abstract asserts state-of-the-art results in audio perceptual quality and cross-modal synchronization but provides no quantitative metrics, baselines, ablation studies, or error analysis to support this claim. The soundness of the central empirical claim cannot be verified without the full results and methods sections.

    Authors: We agree that the abstract itself contains no quantitative metrics, baselines, or detailed analysis, which is standard due to length constraints. The full manuscript provides these in the Experiments section (including perceptual quality metrics for audio, synchronization metrics across modalities, comparisons against multiple baselines, ablation studies on the semantic-guided harmonization and bidirectional cross-modal forcing components, and supporting analysis). These sections directly support the state-of-the-art claims summarized in the abstract. The referee's concern regarding verifiability is therefore addressed by the complete paper. revision: no

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a high-level unified framework consisting of independent design choices: semantic-guided harmonization (with bidirectional audio cross-attention and semantic-conditioned gating) inside the audio stream, plus bidirectional cross-modal forcing with decoupled denoising schedules for audio-motion sync. No equations, first-principles derivations, or quantitative predictions appear in the provided text. Claims of SOTA performance are tied directly to 'extensive experiments' rather than any internal reduction to fitted parameters or self-referential definitions. The strategies target stated modality heterogeneity without reducing to their own inputs by construction. This is the common case of a non-circular engineering paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities are detailed. The work appears to rely on standard deep learning assumptions (e.g., diffusion-style denoising) but these are not enumerated.

pith-pipeline@v0.9.0 · 5517 in / 1077 out tokens · 54427 ms · 2026-05-12T02:23:25.744550+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 4 internal anchors

  1. [1]

    YouTube-8M: A Large-Scale Video Classification Benchmark

    Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, A.P., Toderici, G., Varadarajan, B., Vijayanarasimhan, S.: Youtube-8m: A large-scale video classification benchmark. In: arXiv:1609.08675 (2016),https://arxiv.org/pdf/1609.08675v1.pdf

  2. [2]

    An, H., Hu, W., Huang, S., Huang, S., Li, R., Liang, Y., Shao, J., Song, Y., Wang, Z., Yuan, C., Zhang, C., Zhang, H., Zhuang, W., Li, X.: Ai flow: Perspectives, scenarios, and approaches (2025),https://arxiv.org/abs/2506.12479

  3. [3]

    Chen, B., Monso, D.M., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Dif- fusion forcing: Next-token prediction meets full-sequence diffusion (2024),https: //arxiv.org/abs/2407.01392

  4. [4]

    In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2020)

    Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: Vggsound: A large-scale audio-visual dataset. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2020)

  5. [5]

    In: CVPR (2025)

    Cheng, H.K., Ishii, M., Hayakawa, A., Shibuya, T., Schwing, A., Mitsufuji, Y.: MMAudio: Taming multimodal joint training for high-quality video-to-audio syn- thesis. In: CVPR (2025)

  6. [6]

    In: ICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Elizalde, B., Deshmukh, S., Al Ismail, M., Wang, H.: Clap learning audio con- cepts from natural language supervision. In: ICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)

  7. [7]

    In: Proc

    Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio set: An ontology and human-labeled dataset for audio events. In: Proc. IEEE ICASSP 2017. New Orleans, LA (2017)

  8. [8]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15180– 15190 (2023)

  9. [9]

    Google DeepMind: Veo: A text-to-video generation system (2025),https:// storage.googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf

  10. [10]

    HaCohen, Y., Brazowski, B., Chiprut, N., Bitterman, Y., Kvochko, A., Berkowitz, A., Shalem, D., Lifschitz, D., Moshe, D., Porat, E., Richardson, E., Shiran, G., Chachy, I., Chetboun, J., Finkelson, M., Kupchick, M., Zabari, N., Guetta, N., Kotler, N., Bibi, O., Gordon, O., Panet, P., Benita, R., Armon, S., Kulikov, V., Inger,Y.,Shiftan,Y.,Melumian,Z.,Farb...

  11. [11]

    arXiv preprint arXiv:2511.21579 (2025)

    Hu, T., Yu, Z., Zhang, G., Su, Z., Zhou, Z., Zhang, Y., Zhou, Y., Lu, Q., Yi, R.: Harmony: Harmonizing audio and video generation through cross-task synergy. arXiv preprint arXiv:2511.21579 (2025)

  12. [12]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion (2025),https://arxiv.org/abs/ 2506.08009 16 S. Cheng et al

  13. [13]

    In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Iashin, V., Xie, W., Rahtu, E., Zisserman, A.: Synchformer: Efficient synchroniza- tion from sparse cues. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5325–5329. IEEE (2024)

  14. [14]

    Vicinagearth1(1), 8 (2024)

    Jiang, W., Zhang, Y., Zheng, S., Liu, S., Yan, S.: Data augmentation in human- centric vision. Vicinagearth1(1), 8 (2024)

  15. [15]

    Li, H., Xu, M., Zhan, Y., Mu, S., Li, J., Cheng, K., Chen, Y., Chen, T., Ye, M., Wang, J., Zhu, S.: Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation (2025),https://arxiv.org/abs/2412.00115

  16. [16]

    Vicinagearth1(9) (2024).https://doi

    Li, X., Wang, S., Zeng, S., et al.: A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges. Vicinagearth1(9) (2024).https://doi. org/10.1007/s44336-024-00009-2

  17. [17]

    IEEE Transactions on Neural Networks and Learn- ing Systems35(6), 8708–8714 (2024).https://doi.org/10.1109/TNNLS.2022

    Li, X.: Positive-incentive noise. IEEE Transactions on Neural Networks and Learn- ing Systems35(6), 8708–8714 (2024).https://doi.org/10.1109/TNNLS.2022. 3224577

  18. [18]

    Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling (2023),https://arxiv.org/abs/2210.02747

  19. [19]

    Liu, H., Lan, G.L., Mei, X., Ni, Z., Kumar, A., Nagaraja, V., Wang, W., Plumbley, M.D., Shi, Y., Chandra, V.: Syncflow: Toward temporally aligned joint audio-video generation from text (2024),https://arxiv.org/abs/2412.15220

  20. [20]

    Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025

    Liu, K., Li, W., Chen, L., Wu, S., Zheng, Y., Ji, J., Zhou, F., Jiang, R., Luo, J., Fei, H., et al.: Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization. arXiv preprint arXiv:2503.23377 (2025)

  21. [21]

    Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time (2025),https://arxiv.org/abs/2509.25161

  22. [22]

    arXiv preprint arXiv:2510.01284 (2025)

    Low, C., Wang, W., Katyal, C.: Ovi: Twin backbone cross-modal fusion for audio- video generation. arXiv preprint arXiv:2510.01284 (2025)

  23. [23]

    IEEE/ACM Transactions on Au- dio, Speech, and Language Processing pp

    Mei, X., Meng, C., Liu, H., Kong, Q., Ko, T., Zhao, C., Plumbley, M.D., Zou, Y., Wang, W.: WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. IEEE/ACM Transactions on Au- dio, Speech, and Language Processing pp. 1–15 (2024)

  24. [24]

    OpenAI: Sora 2 system card (2025),https://cdn.openai.com/pdf/50d5973c- c4ff-4c2d-986f-c72b5d0ff069/sora_2_system_card.pdf

  25. [25]

    A lip sync expert is all you need for speech to lip generation in the wild,

    Prajwal, K.R., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia. p. 484–492. MM ’20, ACM (Oct 2020).https://doi.org/10.1145/3394171.3413532,http://dx.doi.org/ 10.1145/3394171.3413532

  26. [26]

    In: International conference on machine learning

    Radford,A.,Kim,J.W.,Xu,T.,Brockman,G.,McLeavey,C.,Sutskever,I.:Robust speech recognition via large-scale weak supervision. In: International conference on machine learning. pp. 28492–28518. PMLR (2023)

  27. [27]

    In: CVPR (2023)

    Ruan, L., Ma, Y., Yang, H., He, H., Liu, B., Fu, J., Yuan, N.J., Jin, Q., Guo, B.: Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In: CVPR (2023)

  28. [28]

    Advances in neural information processing systems35, 25278–25294 (2022)

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large- scale dataset for training next generation image-text models. Advances in neural information processing systems35, 25278–25294 (2022)

  29. [29]

    Vicinagearth2(9) (2025).https://doi.org/10.1007/ s44336-025-00018-9 Abbreviated paper title 17

    Shen, Y., Zhang, D.: A survey of language-guided video object segmentation: from referring to reasoning. Vicinagearth2(9) (2025).https://doi.org/10.1007/ s44336-025-00018-9 Abbreviated paper title 17

  30. [30]

    Corresponding authors: Xie Chen and Xipeng Qiu

    SII-OpenMOSS Team, Yu, D., Chen, M., Chen, Q., Luo, Q., Wu, Q., Cheng, Q., Li, R., Liang, T., Zhang, W., Tu, W., Peng, X., Gao, Y., Huo, Y., Zhu, Y., Luo, Y., Zhang, Y., Song, Y., Xu, Z., Zhang, Z., Yang, C., Chang, C., Zhou, C., Chen, H., Ma, H., Li, J., Tong, J., Liu, J., Chen, K., Li, S., Wang, S., Jiang, W., Fei, Z., Ning, Z., Li, C., Li, C., He, Z., ...

  31. [31]

    DINOv3

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

  32. [32]

    Song, K., Chen, B., Simchowitz, M., Du, Y., Tedrake, R., Sitzmann, V.: History- guided video diffusion (2025),https://arxiv.org/abs/2502.06764

  33. [33]

    Team, O., Yu, D., Chen, M., Chen, Q., Luo, Q., Wu, Q., Cheng, Q., Li, R., Liang, T., Zhang, W., Tu, W., Peng, X., Gao, Y., Huo, Y., Zhu, Y., Luo, Y., Zhang, Y., Song, Y., Xu, Z., Zhang, Z., Yang, C., Chang, C., Zhou, C., Chen, H., Ma, H., Li, J., Tong, J., Liu, J., Chen, K., Li, S., Jiang, S., Wang, S., Jiang, W., Fei, Z., Ning, Z., Li, C., Li, C., He, Z....

  34. [34]

    In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

    Tian, Z., Liu, Z., Yuan, R., Pan, J., Liu, Q., Tan, X., Chen, Q., Xue, W., Guo, Y.: Vidmuse: A simple video-to-music generation framework with long-short-term modeling. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 18782–18793 (2025)

  35. [35]

    Audiobox: Unified audio generation with natural language prompts.arXiv preprint arXiv:2312.15821,

    Vyas, A., Shi, B., Le, M., Tjandra, A., Wu, Y.C., Guo, B., Zhang, J., Zhang, X., Adkins, R., Ngan, W., et al.: Audiobox: Unified audio generation with natural language prompts. arXiv preprint arXiv:2312.15821 (2023)

  36. [36]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

  37. [37]

    Universe-1: Unified audio-video generation via stitching of experts.arXiv preprint arXiv:2509.06155, 2025

    Wang, D., Zuo, W., Li, A., Chen, L.H., Liao, X., Zhou, D., Yin, Z., Dai, X., Jiang, D., Yu, G.: Universe-1: Unified audio-video generation via stitching of experts. arXiv preprint arXiv:2509.06155 (2025)

  38. [38]

    Wang, J.C., Lu, W.T., Chen, J.: Mel-roformer for vocal separation and vocal melody transcription (2024),https://arxiv.org/abs/2409.04702

  39. [39]

    Wang, L.X.X., Zhang, H., Dong, C., Shan, Y.: Vfhq: A high-quality dataset and benchmark for video face super-resolution (2022),https://arxiv.org/abs/2205. 03409

  40. [40]

    Advances in Neural Information Processing Systems37, 65618–65642 (2024)

    Wang, W., Yang, Y.: Vidprom: A million-scale real prompt-gallery dataset for text- to-video diffusion models. Advances in Neural Information Processing Systems37, 65618–65642 (2024)

  41. [41]

    In: CVPR (2023)

    Yu, J., Zhu, H., Jiang, L., Loy, C.C., Cai, W., Wu, W.: CelebV-Text: A large-scale facial text-video dataset. In: CVPR (2023)

  42. [42]

    Cheng et al

    Yuan, R., Lin, H., Guo, S., Zhang, G., Pan, J., Zang, Y., Liu, H., Du, X., Du, X., Ye, Z., Zheng, T., Jiang, Z., Ma, Y., Liu, M., Yu, L., Tian, Z., Zhou, Z., Xue, L., Qu, X., Li, Y., Shen, T., Ma, Z., Wu, S., Zhan, J., Wang, C., Wang, Y., Zhou, X., Chi, X., Zhang, X., Yang, Z., Liang, Y., Wang, X., Liu, S., Mei, L., Li, P., Chen, Y., Lin, C., Chen, X., Xi...

  43. [43]

    Yuan, R., Lin, H., Guo, S., Zhang, G., Pan, J., Zang, Y., Liu, H., Liang, Y., Ma, W., Du, X., Du, X., Ye, Z., Zheng, T., Jiang, Z., Ma, Y., Liu, M., Tian, Z., Zhou, Z., Xue, L., Qu, X., Li, Y., Wu, S., Shen, T., Ma, Z., Zhan, J., Wang, C., Wang, Y., Chi, X., Zhang, X., Yang, Z., Wang, X., Liu, S., Mei, L., Li, P., Wang, J., Yu, J., Pang, G., Li, X., Wang,...

  44. [44]

    arXiv preprint arXiv:2511.03334 (2025)

    Zhang, G., Zhou, Z., Hu, T., Peng, Z., Zhang, Y., Chen, Y., Zhou, Y., Lu, Q., Wang, L.: Uniavgen: Unified audio and video generation with asymmetric cross- modal interactions. arXiv preprint arXiv:2511.03334 (2025)

  45. [45]

    IEEE Transactions on Pattern Analysis and Machine Intelli- gence47(9), 8313–8320 (2025).https://doi.org/10.1109/TPAMI.2025.3575295

    Zhang, H., Huang, S., Guo, Y., Li, X.: Variational positive-incentive noise: How noise benefits models. IEEE Transactions on Pattern Analysis and Machine Intelli- gence47(9), 8313–8320 (2025).https://doi.org/10.1109/TPAMI.2025.3575295

  46. [46]

    arXiv preprint arXiv:2412.16563 (2024)

    Zhang, X., Li, J., Zhang, J., Dang, Z., Ren, J., Bo, L., Tu, Z.: Semtalk: Holistic co-speech motion generation with frame-level semantic emphasis. arXiv preprint arXiv:2412.16563 (2024)

  47. [47]

    Zhang, X., Li, J., Zhang, J., Ren, J., Bo, L., Tu, Z.: Echomask: Speech-queried attention-based mask modeling for holistic co-speech motion generation (2025), https://arxiv.org/abs/2504.09209

  48. [48]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3661–3670 (2021)

  49. [49]

    Zhao, L., Feng, L., Ge, D., Chen, R., Yi, F., Zhang, C., Zhang, X.L., Li, X.: Uni- form: A unified multi-task diffusion transformer for audio-video generation (2025), https://arxiv.org/abs/2502.03897

  50. [50]

    Zipvoice: Fast and high-quality zero-shot text-to-speech with flow matching.arXiv preprint arXiv:2506.13053, 2025

    Zhu, H., Kang, W., Yao, Z., Guo, L., Kuang, F., Li, Z., Zhuang, W., Lin, L., Povey, D.: Zipvoice: Fast and high-quality zero-shot text-to-speech with flow matching. arXiv preprint arXiv:2506.13053 (2025)