arxiv: 2605.08729 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.GR· cs.MM· cs.SD

Recognition: 2 theorem links

· Lean Theorem

Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

Shihao Cheng , Jiaxu Zhang , Quanyue Song , Shansong Liu , Zhizhi Guo , Xiaolei Zhang , Chi Zhang , Xuelong Li

show 1 more author

Zhigang Tu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:23 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.MMcs.SD

keywords audio-video generationmultimodal harmonizationcross-modal synchronizationspeech and sound effectsdenoising scheduleshuman-centric videosemantic-guided strategy

0 comments

The pith

Unison is a unified framework that harmonizes motion, speech, and sound in human-centric video generation through explicit multimodal strategies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Unison as a framework for creating videos that combine human motion, spoken dialogue, and environmental sounds without the mismatches common in prior models. It addresses the challenge of differing timing patterns across these elements by decoupling speech from sound effects in the audio generation process and using cross-attention to recompose them based on semantics. A separate forcing mechanism aligns the audio output with visual motion by having the less noisy signal guide the noisier one during generation steps. Experiments show gains in audio quality and timing accuracy over existing approaches. A reader would care because coherent multimodal output could make AI videos more usable for storytelling, training, or virtual environments.

Core claim

Unison is a unified framework that explicitly promotes coherence across the motion, speech, and sound modalities. Within the audio stream, it employs a semantic-guided harmonization strategy that decouples speech and sound-effect components using bidirectional audio cross-attention and semantic-conditioned gating to mitigate speech dominance and improve clarity. For audio-motion synchronization, it proposes a bidirectional cross-modal forcing strategy in which the cleaner modality guides the noisier one through decoupled denoising schedules reinforced by progressive stabilization. Extensive experiments demonstrate state-of-the-art performance in audio perceptual quality and cross-modal sync.

What carries the argument

Semantic-guided harmonization strategy with bidirectional audio cross-attention and semantic-conditioned gating, together with bidirectional cross-modal forcing through decoupled denoising schedules.

If this is right

Reduces mismatches between human motion, spoken words, and background sounds in the final video.
Decouples and recomposes audio components to lessen speech dominance while preserving environmental effects.
Enables the cleaner signal to steer the noisier one during generation for tighter cross-modal timing.
Delivers measurable gains in perceptual audio quality and synchronization accuracy over prior unified models.
Shows that progressive stabilization during denoising helps maintain overall coherence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The decoupling tactic could transfer to other generation tasks where one signal type tends to overpower others, such as text-to-image with added audio.
If the forcing strategy scales, it might reduce the need for separate post-processing stages in long-form video pipelines.
Applying the same bidirectional guidance principle to additional inputs like camera motion or lighting could yield even tighter scene consistency.
The emphasis on semantic gating suggests a general route for making diffusion-based multimodal models more controllable without extra labels.

Load-bearing premise

The assumption that the semantic-guided harmonization and bidirectional cross-modal forcing will reliably improve coherence without introducing new artifacts or requiring extensive post-hoc tuning on specific datasets.

What would settle it

A side-by-side human evaluation on held-out human-centric video clips where viewers consistently rate Unison outputs as having worse speech-sound balance or motion-audio timing than outputs from a strong baseline model.

Figures

Figures reproduced from arXiv: 2605.08729 by Chi Zhang, Jiaxu Zhang, Quanyue Song, Shansong Liu, Shihao Cheng, Xiaolei Zhang, Xuelong Li, Zhigang Tu, Zhizhi Guo.

**Figure 2.** Figure 2: Overview of Unison. Unison couples a video branch and an audio branch via bidirectional cross-attention. The audio branch employs a Semantic-Guided Harmonization Strategy for independent speech and sound-effect generation, utilizing a Bidirectional Audio Cross-Attention (Bi-ACA) module to mutually refine speech and sound-effect features, effectively enhancing their respective clarity. At each interaction … view at source ↗

**Figure 3.** Figure 3: Bidirectional Cross-Modal Forcing strategy for audio-visual align [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison between Unison and the state-of-the-art methods, including Universe-1 [37], UniAVGen [44] and MOVA [30]. to determine Perceptual Quality (PQ) and Content Usefulness (CU). To evaluate speech-text alignment, we isolate vocal components via Mel-RoFormer [38] and compute the Word Error Rate (WER) using Whisper-large-v3 [26]. (3) For cross-modal consistency, we utilize CLAP [6] for audi… view at source ↗

**Figure 5.** Figure 5: Bidirectional Synthesis of Audio-to-Video and Video-to-Audio. acoustic components, including lip movements and impact transients. Our model maintains superior acoustic layering, ensuring intelligible speech without suppressing salient environmental audio. Audio-to-Video and Video-to-Audio Generation. Unison leverages decoupled denoising schedules and bidirectional guidance to achieve precise modal transl… view at source ↗

**Figure 6.** Figure 6: Ablation experiments on the Semantic-Guided Audio Harmonization Strategy. w/o Bidirectional Cross-modal Forcing Strategy Speech Sound Effects Transcription Caption The city is so big, but where is our home? A young woman with long, vibrant blue hair is depicted in a medium profile shot, playing a digital piano on an outdoor balcony during the 'blue hour'. The background features a soft-focus urban cityscap… view at source ↗

**Figure 7.** Figure 7: Ablation experiments on the Bidirectional Cross-modal Forcing Strategy. rigorous T2AV and TI2AV assessment. As shown in [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Analysis of SCG gate behavior. (a) Layer-wise: gate polarization increases with model depth. (b) Timestep-wise: gate divergence intensifies as denoising progresses. (c) Instance-wise: mean gate values across semantic categories, demonstrating content-adaptive modulation. SCG mitigates the dominance of speech over subtle environmental textures via dynamic rebalancing. In sports broadcasting, the mechanism… view at source ↗

**Figure 9.** Figure 9: Results of the user study User Study. We conducted a user study with 10 video samples and 25 participants from diverse backgrounds, evaluating lipspeech synchrony, speech-sound harmony, and motion-audio alignment (considering both speech and environmental sounds). Participants were required to rank shuffled videos across different methods, including UniAVGen [44], MOVA [30], and LTX-2 [10]. As shown in … view at source ↗

read the original abstract

Motion, speech, and sound effects are fundamental elements of human-centric videos, yet their heterogeneous temporal characteristics make joint generation highly challenging. Existing audio-video generation models often fail to maintain consistent alignment across these modalities, leading to noticeable mismatches between motion, speech, and environmental sounds. We present Unison, a unified framework that explicitly promotes coherence across the motion, speech, and sound modalities. Within the audio stream, Unison employs a semantic-guided harmonization strategy that decouples the generation of speech and sound-effect components. Leveraging bidirectional audio cross-attention and semantic-conditioned gating for semantic-driven adaptive recomposition, this approach effectively mitigates speech dominance and enhances acoustic clarity. For audio-motion synchronization, we propose a bidirectional cross-modal forcing strategy where the cleaner modality guides the noisier one through decoupled denoising schedules, reinforced by a progressive stabilization strategy. Extensive experiments demonstrate that Unison achieves state-of-the-art performance in both audio perceptual quality and cross-modal synchronization, highlighting the importance of explicit multimodal harmonization in human-centric video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Unison adds two concrete techniques for balancing speech, sound effects, and motion in generated videos, but the SOTA claims rest on experiments whose details are not visible here.

read the letter

The main takeaway is that the paper introduces Unison as a unified diffusion-style framework that splits audio generation into speech and sound-effect streams and then forces better alignment with motion through bidirectional guidance and decoupled schedules. The semantic-guided harmonization uses cross-attention plus gating to stop speech from overwhelming other audio, while the cross-modal forcing lets the less-noisy modality steer the noisier one during denoising, with progressive stabilization added on top. These are explicit design decisions rather than generic scaling, which is the part that feels new compared with prior joint-generation attempts. The abstract does a clear job stating why heterogeneous timing across modalities creates persistent mismatches and why treating them separately inside the audio path and across modalities makes sense. If the full paper backs this up with ablations that isolate each component, the work gives practitioners something concrete to try. The soft spot is the complete absence of numbers in the summary. No tables, no baseline comparisons, no reported margins on perceptual metrics or synchronization scores appear, so it is impossible to tell whether the claimed improvements are large, small, or dataset-specific. The stress-test note correctly flags that nothing in the high-level logic contradicts itself, but that does not substitute for seeing the actual results and failure cases. This is the sort of paper that would interest people already working on audio-visual diffusion models or video synthesis for entertainment and simulation. It is not a foundational rethinking of the problem, but the targeted fixes are specific enough that a referee could check whether they deliver. I would send it out for peer review so the experimental section gets proper scrutiny rather than desk-rejecting it on the abstract alone.

Referee Report

1 major / 0 minor

Summary. The paper presents Unison, a unified framework for generating motion, speech, and sound effects in human-centric videos. It employs a semantic-guided harmonization strategy in the audio stream using bidirectional audio cross-attention and semantic-conditioned gating to decouple speech and sound-effect generation, mitigating speech dominance. For audio-motion synchronization, it proposes bidirectional cross-modal forcing with decoupled denoising schedules and a progressive stabilization strategy. The authors claim that extensive experiments show state-of-the-art performance in audio perceptual quality and cross-modal synchronization.

Significance. If the empirical claims hold, this contribution would be significant in the field of audio-video generation by providing explicit mechanisms for multimodal coherence, which could improve the quality of generated human-centric content and influence subsequent research on handling heterogeneous modalities.

major comments (1)

Abstract: The abstract asserts state-of-the-art results in audio perceptual quality and cross-modal synchronization but provides no quantitative metrics, baselines, ablation studies, or error analysis to support this claim. The soundness of the central empirical claim cannot be verified without the full results and methods sections.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need to substantiate the empirical claims. We address the major comment point by point below.

read point-by-point responses

Referee: Abstract: The abstract asserts state-of-the-art results in audio perceptual quality and cross-modal synchronization but provides no quantitative metrics, baselines, ablation studies, or error analysis to support this claim. The soundness of the central empirical claim cannot be verified without the full results and methods sections.

Authors: We agree that the abstract itself contains no quantitative metrics, baselines, or detailed analysis, which is standard due to length constraints. The full manuscript provides these in the Experiments section (including perceptual quality metrics for audio, synchronization metrics across modalities, comparisons against multiple baselines, ablation studies on the semantic-guided harmonization and bidirectional cross-modal forcing components, and supporting analysis). These sections directly support the state-of-the-art claims summarized in the abstract. The referee's concern regarding verifiability is therefore addressed by the complete paper. revision: no

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a high-level unified framework consisting of independent design choices: semantic-guided harmonization (with bidirectional audio cross-attention and semantic-conditioned gating) inside the audio stream, plus bidirectional cross-modal forcing with decoupled denoising schedules for audio-motion sync. No equations, first-principles derivations, or quantitative predictions appear in the provided text. Claims of SOTA performance are tied directly to 'extensive experiments' rather than any internal reduction to fitted parameters or self-referential definitions. The strategies target stated modality heterogeneity without reducing to their own inputs by construction. This is the common case of a non-circular engineering paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities are detailed. The work appears to rely on standard deep learning assumptions (e.g., diffusion-style denoising) but these are not enumerated.

pith-pipeline@v0.9.0 · 5517 in / 1077 out tokens · 54427 ms · 2026-05-12T02:23:25.744550+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 4 internal anchors

[1]

YouTube-8M: A Large-Scale Video Classification Benchmark

Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, A.P., Toderici, G., Varadarajan, B., Vijayanarasimhan, S.: Youtube-8m: A large-scale video classification benchmark. In: arXiv:1609.08675 (2016),https://arxiv.org/pdf/1609.08675v1.pdf

work page Pith review arXiv 2016
[2]

An, H., Hu, W., Huang, S., Huang, S., Li, R., Liang, Y., Shao, J., Song, Y., Wang, Z., Yuan, C., Zhang, C., Zhang, H., Zhuang, W., Li, X.: Ai flow: Perspectives, scenarios, and approaches (2025),https://arxiv.org/abs/2506.12479

work page arXiv 2025
[3]

Chen, B., Monso, D.M., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Dif- fusion forcing: Next-token prediction meets full-sequence diffusion (2024),https: //arxiv.org/abs/2407.01392

work page arXiv 2024
[4]

In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2020)

Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: Vggsound: A large-scale audio-visual dataset. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2020)

work page 2020
[5]

In: CVPR (2025)

Cheng, H.K., Ishii, M., Hayakawa, A., Shibuya, T., Schwing, A., Mitsufuji, Y.: MMAudio: Taming multimodal joint training for high-quality video-to-audio syn- thesis. In: CVPR (2025)

work page 2025
[6]

In: ICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP)

Elizalde, B., Deshmukh, S., Al Ismail, M., Wang, H.: Clap learning audio con- cepts from natural language supervision. In: ICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)

work page 2023
[7]

In: Proc

Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio set: An ontology and human-labeled dataset for audio events. In: Proc. IEEE ICASSP 2017. New Orleans, LA (2017)

work page 2017
[8]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15180– 15190 (2023)

work page 2023
[9]

Google DeepMind: Veo: A text-to-video generation system (2025),https:// storage.googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf

work page 2025
[10]

HaCohen, Y., Brazowski, B., Chiprut, N., Bitterman, Y., Kvochko, A., Berkowitz, A., Shalem, D., Lifschitz, D., Moshe, D., Porat, E., Richardson, E., Shiran, G., Chachy, I., Chetboun, J., Finkelson, M., Kupchick, M., Zabari, N., Guetta, N., Kotler, N., Bibi, O., Gordon, O., Panet, P., Benita, R., Armon, S., Kulikov, V., Inger,Y.,Shiftan,Y.,Melumian,Z.,Farb...

work page Pith review arXiv 2026
[11]

arXiv preprint arXiv:2511.21579 (2025)

Hu, T., Yu, Z., Zhang, G., Su, Z., Zhou, Z., Zhang, Y., Zhou, Y., Lu, Q., Yi, R.: Harmony: Harmonizing audio and video generation through cross-task synergy. arXiv preprint arXiv:2511.21579 (2025)

work page arXiv 2025
[12]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion (2025),https://arxiv.org/abs/ 2506.08009 16 S. Cheng et al

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Iashin, V., Xie, W., Rahtu, E., Zisserman, A.: Synchformer: Efficient synchroniza- tion from sparse cues. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5325–5329. IEEE (2024)

work page 2024
[14]

Vicinagearth1(1), 8 (2024)

Jiang, W., Zhang, Y., Zheng, S., Liu, S., Yan, S.: Data augmentation in human- centric vision. Vicinagearth1(1), 8 (2024)

work page 2024
[15]

Li, H., Xu, M., Zhan, Y., Mu, S., Li, J., Cheng, K., Chen, Y., Chen, T., Ye, M., Wang, J., Zhu, S.: Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation (2025),https://arxiv.org/abs/2412.00115

work page arXiv 2025
[16]

Vicinagearth1(9) (2024).https://doi

Li, X., Wang, S., Zeng, S., et al.: A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges. Vicinagearth1(9) (2024).https://doi. org/10.1007/s44336-024-00009-2

work page doi:10.1007/s44336-024-00009-2 2024
[17]

IEEE Transactions on Neural Networks and Learn- ing Systems35(6), 8708–8714 (2024).https://doi.org/10.1109/TNNLS.2022

Li, X.: Positive-incentive noise. IEEE Transactions on Neural Networks and Learn- ing Systems35(6), 8708–8714 (2024).https://doi.org/10.1109/TNNLS.2022. 3224577

work page doi:10.1109/tnnls.2022 2024
[18]

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling (2023),https://arxiv.org/abs/2210.02747

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Liu, H., Lan, G.L., Mei, X., Ni, Z., Kumar, A., Nagaraja, V., Wang, W., Plumbley, M.D., Shi, Y., Chandra, V.: Syncflow: Toward temporally aligned joint audio-video generation from text (2024),https://arxiv.org/abs/2412.15220

work page arXiv 2024
[20]

Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025

Liu, K., Li, W., Chen, L., Wu, S., Zheng, Y., Ji, J., Zhou, F., Jiang, R., Luo, J., Fei, H., et al.: Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization. arXiv preprint arXiv:2503.23377 (2025)

work page arXiv 2025
[21]

Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time (2025),https://arxiv.org/abs/2509.25161

work page arXiv 2025
[22]

arXiv preprint arXiv:2510.01284 (2025)

Low, C., Wang, W., Katyal, C.: Ovi: Twin backbone cross-modal fusion for audio- video generation. arXiv preprint arXiv:2510.01284 (2025)

work page arXiv 2025
[23]

IEEE/ACM Transactions on Au- dio, Speech, and Language Processing pp

Mei, X., Meng, C., Liu, H., Kong, Q., Ko, T., Zhao, C., Plumbley, M.D., Zou, Y., Wang, W.: WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. IEEE/ACM Transactions on Au- dio, Speech, and Language Processing pp. 1–15 (2024)

work page 2024
[24]

OpenAI: Sora 2 system card (2025),https://cdn.openai.com/pdf/50d5973c- c4ff-4c2d-986f-c72b5d0ff069/sora_2_system_card.pdf

work page 2025
[25]

A lip sync expert is all you need for speech to lip generation in the wild,

Prajwal, K.R., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia. p. 484–492. MM ’20, ACM (Oct 2020).https://doi.org/10.1145/3394171.3413532,http://dx.doi.org/ 10.1145/3394171.3413532

work page doi:10.1145/3394171.3413532 2020
[26]

In: International conference on machine learning

Radford,A.,Kim,J.W.,Xu,T.,Brockman,G.,McLeavey,C.,Sutskever,I.:Robust speech recognition via large-scale weak supervision. In: International conference on machine learning. pp. 28492–28518. PMLR (2023)

work page 2023
[27]

In: CVPR (2023)

Ruan, L., Ma, Y., Yang, H., He, H., Liu, B., Fu, J., Yuan, N.J., Jin, Q., Guo, B.: Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In: CVPR (2023)

work page 2023
[28]

Advances in neural information processing systems35, 25278–25294 (2022)

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large- scale dataset for training next generation image-text models. Advances in neural information processing systems35, 25278–25294 (2022)

work page 2022
[29]

Vicinagearth2(9) (2025).https://doi.org/10.1007/ s44336-025-00018-9 Abbreviated paper title 17

Shen, Y., Zhang, D.: A survey of language-guided video object segmentation: from referring to reasoning. Vicinagearth2(9) (2025).https://doi.org/10.1007/ s44336-025-00018-9 Abbreviated paper title 17

work page 2025
[30]

Corresponding authors: Xie Chen and Xipeng Qiu

SII-OpenMOSS Team, Yu, D., Chen, M., Chen, Q., Luo, Q., Wu, Q., Cheng, Q., Li, R., Liang, T., Zhang, W., Tu, W., Peng, X., Gao, Y., Huo, Y., Zhu, Y., Luo, Y., Zhang, Y., Song, Y., Xu, Z., Zhang, Z., Yang, C., Chang, C., Zhou, C., Chen, H., Ma, H., Li, J., Tong, J., Liu, J., Chen, K., Li, S., Wang, S., Jiang, W., Fei, Z., Ning, Z., Li, C., Li, C., He, Z., ...

work page doi:10.48550/arxiv.2602.08794 2026
[31]

DINOv3

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Song, K., Chen, B., Simchowitz, M., Du, Y., Tedrake, R., Sitzmann, V.: History- guided video diffusion (2025),https://arxiv.org/abs/2502.06764

work page arXiv 2025
[33]

Team, O., Yu, D., Chen, M., Chen, Q., Luo, Q., Wu, Q., Cheng, Q., Li, R., Liang, T., Zhang, W., Tu, W., Peng, X., Gao, Y., Huo, Y., Zhu, Y., Luo, Y., Zhang, Y., Song, Y., Xu, Z., Zhang, Z., Yang, C., Chang, C., Zhou, C., Chen, H., Ma, H., Li, J., Tong, J., Liu, J., Chen, K., Li, S., Jiang, S., Wang, S., Jiang, W., Fei, Z., Ning, Z., Li, C., Li, C., He, Z....

work page arXiv 2026
[34]

In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

Tian, Z., Liu, Z., Yuan, R., Pan, J., Liu, Q., Tan, X., Chen, Q., Xue, W., Guo, Y.: Vidmuse: A simple video-to-music generation framework with long-short-term modeling. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 18782–18793 (2025)

work page 2025
[35]

Audiobox: Unified audio generation with natural language prompts.arXiv preprint arXiv:2312.15821,

Vyas, A., Shi, B., Le, M., Tjandra, A., Wu, Y.C., Guo, B., Zhang, J., Zhang, X., Adkins, R., Ngan, W., et al.: Audiobox: Unified audio generation with natural language prompts. arXiv preprint arXiv:2312.15821 (2023)

work page arXiv 2023
[36]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Universe-1: Unified audio-video generation via stitching of experts.arXiv preprint arXiv:2509.06155, 2025

Wang, D., Zuo, W., Li, A., Chen, L.H., Liao, X., Zhou, D., Yin, Z., Dai, X., Jiang, D., Yu, G.: Universe-1: Unified audio-video generation via stitching of experts. arXiv preprint arXiv:2509.06155 (2025)

work page arXiv 2025
[38]

Wang, J.C., Lu, W.T., Chen, J.: Mel-roformer for vocal separation and vocal melody transcription (2024),https://arxiv.org/abs/2409.04702

work page arXiv 2024
[39]

Wang, L.X.X., Zhang, H., Dong, C., Shan, Y.: Vfhq: A high-quality dataset and benchmark for video face super-resolution (2022),https://arxiv.org/abs/2205. 03409

work page 2022
[40]

Advances in Neural Information Processing Systems37, 65618–65642 (2024)

Wang, W., Yang, Y.: Vidprom: A million-scale real prompt-gallery dataset for text- to-video diffusion models. Advances in Neural Information Processing Systems37, 65618–65642 (2024)

work page 2024
[41]

In: CVPR (2023)

Yu, J., Zhu, H., Jiang, L., Loy, C.C., Cai, W., Wu, W.: CelebV-Text: A large-scale facial text-video dataset. In: CVPR (2023)

work page 2023
[42]

Cheng et al

Yuan, R., Lin, H., Guo, S., Zhang, G., Pan, J., Zang, Y., Liu, H., Du, X., Du, X., Ye, Z., Zheng, T., Jiang, Z., Ma, Y., Liu, M., Yu, L., Tian, Z., Zhou, Z., Xue, L., Qu, X., Li, Y., Shen, T., Ma, Z., Wu, S., Zhan, J., Wang, C., Wang, Y., Zhou, X., Chi, X., Zhang, X., Yang, Z., Liang, Y., Wang, X., Liu, S., Mei, L., Li, P., Chen, Y., Lin, C., Chen, X., Xi...

work page 2025
[43]

Yuan, R., Lin, H., Guo, S., Zhang, G., Pan, J., Zang, Y., Liu, H., Liang, Y., Ma, W., Du, X., Du, X., Ye, Z., Zheng, T., Jiang, Z., Ma, Y., Liu, M., Tian, Z., Zhou, Z., Xue, L., Qu, X., Li, Y., Wu, S., Shen, T., Ma, Z., Zhan, J., Wang, C., Wang, Y., Chi, X., Zhang, X., Yang, Z., Wang, X., Liu, S., Mei, L., Li, P., Wang, J., Yu, J., Pang, G., Li, X., Wang,...

work page arXiv 2025
[44]

arXiv preprint arXiv:2511.03334 (2025)

Zhang, G., Zhou, Z., Hu, T., Peng, Z., Zhang, Y., Chen, Y., Zhou, Y., Lu, Q., Wang, L.: Uniavgen: Unified audio and video generation with asymmetric cross- modal interactions. arXiv preprint arXiv:2511.03334 (2025)

work page arXiv 2025
[45]

IEEE Transactions on Pattern Analysis and Machine Intelli- gence47(9), 8313–8320 (2025).https://doi.org/10.1109/TPAMI.2025.3575295

Zhang, H., Huang, S., Guo, Y., Li, X.: Variational positive-incentive noise: How noise benefits models. IEEE Transactions on Pattern Analysis and Machine Intelli- gence47(9), 8313–8320 (2025).https://doi.org/10.1109/TPAMI.2025.3575295

work page doi:10.1109/tpami.2025.3575295 2025
[46]

arXiv preprint arXiv:2412.16563 (2024)

Zhang, X., Li, J., Zhang, J., Dang, Z., Ren, J., Bo, L., Tu, Z.: Semtalk: Holistic co-speech motion generation with frame-level semantic emphasis. arXiv preprint arXiv:2412.16563 (2024)

work page arXiv 2024
[47]

Zhang, X., Li, J., Zhang, J., Ren, J., Bo, L., Tu, Z.: Echomask: Speech-queried attention-based mask modeling for holistic co-speech motion generation (2025), https://arxiv.org/abs/2504.09209

work page arXiv 2025
[48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3661–3670 (2021)

work page 2021
[49]

Zhao, L., Feng, L., Ge, D., Chen, R., Yi, F., Zhang, C., Zhang, X.L., Li, X.: Uni- form: A unified multi-task diffusion transformer for audio-video generation (2025), https://arxiv.org/abs/2502.03897

work page arXiv 2025
[50]

Zipvoice: Fast and high-quality zero-shot text-to-speech with flow matching.arXiv preprint arXiv:2506.13053, 2025

Zhu, H., Kang, W., Yao, Z., Guo, L., Kuang, F., Li, Z., Zhuang, W., Lin, L., Povey, D.: Zipvoice: Fast and high-quality zero-shot text-to-speech with flow matching. arXiv preprint arXiv:2506.13053 (2025)

work page arXiv 2025