MAVIN: Multi-Shot Audio-Visual Generation with Narrative Control

Boxin Shi; Jing Wang; Kaiqi Liu; Kun Gai; Pengfei Wan; Qiulin Wang; Shuchen Weng; Xintao Wang; Yunyao Mao; Zheng Geng

arxiv: 2606.29473 · v1 · pith:TZGGVWPFnew · submitted 2026-06-28 · 💻 cs.CV

MAVIN: Multi-Shot Audio-Visual Generation with Narrative Control

Kaiqi Liu , Yunyao Mao , Ziqi Cai , Zheng Geng , Jing Wang , Qiulin Wang , Xintao Wang , Pengfei Wan

show 3 more authors

Kun Gai Shuchen Weng Boxin Shi

This is my paper

Pith reviewed 2026-06-30 07:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-shot audio-visual generationnarrative controlboundary-aware attentionID-aware propagationmulti-agent scripting pipelineMAVINSet datasetgenerative modelsfilmmaking workflows

0 comments

The pith

MAVIN provides the first framework for multi-shot audio-visual generation with customized narrative control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MAVIN to enable coherent generation of multiple video shots with audio that follow a user-provided narrative. Current models have trouble keeping timing consistent across shots, maintaining the same characters, and fully scripting the content. MAVIN uses boundary-aware attention to respect shot timings, ID-aware propagation to keep identities fixed, and a multi-agent system to create detailed scripts from simple inputs. It also releases the MAVINSet dataset and reports superior results compared to prior methods.

Core claim

MAVIN is the first framework for multi-shot audio-visual generation with customized narrative control. To resolve temporal misalignment, we propose boundary-aware attention, which leverages hierarchical captions and boundary-aware token routing to render audio-visual elements within their respective temporal boundaries. To improve the controllability for multi-subject scenarios, we propose ID-aware propagation, utilizing identity embeddings and an identity-aware mask to bind specific identities to consistent visual appearances and vocal timbres. To provide comprehensive audio-visual narratives, we present a multi-agent scripting pipeline to transform free-form user inputs into hierarchical c

What carries the argument

Boundary-aware attention for temporal alignment combined with ID-aware propagation for identity consistency across shots.

If this is right

Resolves temporal misalignment using boundary-aware attention with hierarchical captions.
Improves controllability in multi-subject cases via ID-aware propagation with embeddings and masks.
Generates comprehensive narratives from free-form inputs using multi-agent scripting.
Achieves state-of-the-art performance on multi-shot audio-visual tasks.
Opens integration of generative models into professional filmmaking workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The techniques could be applied to single-shot or longer-form video generation to improve consistency.
MAVINSet may become a standard benchmark for evaluating narrative coherence in generated media.
Combining this with real-time user feedback could lead to interactive storytelling tools.
Similar modular control strategies might address consistency issues in other generative domains like music or 3D animation.

Load-bearing premise

The boundary-aware attention, ID-aware propagation, and multi-agent scripting pipeline will successfully address temporal misalignment, limited controllability, and incomplete scripting.

What would settle it

A test on MAVINSet where generated sequences still exhibit audio-visual misalignment across shot boundaries or inconsistent character identities and timbres would show the proposed mechanisms fail to resolve the problems.

Figures

Figures reproduced from arXiv: 2606.29473 by Boxin Shi, Jing Wang, Kaiqi Liu, Kun Gai, Pengfei Wan, Qiulin Wang, Shuchen Weng, Xintao Wang, Yunyao Mao, Zheng Geng, Ziqi Cai.

**Figure 1.** Figure 1: Illustration of our MAVIN framework. First row: Leveraging boundary-aware attention, MAVIN enables precise temporal alignment for shot transitions, dialogue intervals, and narrative events. Second row: Through ID-aware propagation, MAVIN allows users to customize multiple subjects via reference images and audio, maintaining identity consistency across complex cinematic narratives. Third row: With the mult… view at source ↗

**Figure 2.** Figure 2: Overview of our automatic data curation pipeline. (a) Shot grouping and filtering. We collect videos from diverse open-source datasets and online platforms. Raw videos are segmented into shots, grouped into multi-shot cinematic clips, and filtered for audio-visual aesthetics and clarity. (b) Hierarchical caption annotation. We generate multi-level captions to capture global storylines, shot-level events, a… view at source ↗

**Figure 3.** Figure 3: Given free-form user inputs (purple block), we first employ a multi-agent scripting pipeline (red block) to transform them into hierarchical captions (blue block), detailing complex narrative timelines with precise shot boundaries and dialogue intervals. Within the ID-aware propagation, input video clips and optional image references are compressed by a Video VAE into latent tokens (orange block). The aud… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison with state-of-the-art relevant generation models. 5.4 Ablation Study As presented in Tab. 2, we ablate key components of our framework to evaluate their individual contributions and report these quantitative results. Since some ablations target customized generation capabilities, we conduct these experiments with explicit video and audio references. W/o BTR (Boundary-aware Token Ro… view at source ↗

read the original abstract

While recent generative models produce high-fidelity videos, they struggle with the complex narrative control required for coherent multi-shot audio-visual generation. Existing methods suffer from temporal misalignment, limited controllability, and incomplete scripting. In this paper, we propose MAVIN, the first framework for multi-shot audio-visual generation with customized narrative control. To resolve temporal misalignment, we propose boundary-aware attention, which leverages hierarchical captions and boundary-aware token routing to render audio-visual elements within their respective temporal boundaries. To improve the controllability for multi-subject scenarios, we propose ID-aware propagation, utilizing identity embeddings and an identity-aware mask to bind specific identities to consistent visual appearances and vocal timbres. To provide comprehensive audio-visual narratives, we present a multi-agent scripting pipeline to transform free-form user inputs into hierarchical captions. Furthermore, we construct MAVINSet, a multi-shot audio-visual dataset for robust training and evaluation. Extensive experiments demonstrate that MAVIN achieves state-of-the-art performance, opening up a new avenue for integrating generative models into professional filmmaking workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAVIN names plausible fixes for multi-shot AV coherence but the abstract supplies zero data or comparisons, so the SOTA claim and mechanism efficacy remain untestable.

read the letter

The main thing here is that the paper describes a framework called MAVIN for multi-shot audio-visual generation with narrative control, but the abstract alone gives no numbers, baselines, or results to evaluate whether the proposed pieces actually work.

What is new is the explicit combination of boundary-aware attention that uses hierarchical captions and token routing to keep elements inside shot boundaries, ID-aware propagation that carries identity embeddings and masks to hold visual and vocal consistency across shots, a multi-agent scripting pipeline that turns free-form text into structured captions, and the release of the MAVINSet dataset for training and testing. These target documented problems in current video generators: shots that drift out of sync, subjects that change appearance, and scripts that stay too loose.

The paper does a reasonable job laying out the problem and sketching an architecture that tries to solve it in one system. The components are described at a level that lets a reader see how they might fit together.

The soft spots are large and central. The abstract states that the new attention and propagation methods resolve misalignment and controllability issues and that MAVIN reaches state-of-the-art, yet it contains no metrics, no qualitative examples, no ablation studies, and no comparison to prior multi-shot or controllable video work. The stress-test note is right on target: the descriptions of token routing and identity masks do not specify hard constraints or drift-prevention steps, so there is no way to judge whether they actually isolate boundaries or maintain identity. The multi-agent pipeline's success at producing complete narratives is also asserted without any reported measure. Because the full text was not supplied for this review, these gaps cannot be checked.

This paper is for researchers working on controllable generative video and audio for longer-form content. Someone hunting for architectural ideas might skim it for the component names. Anyone needing validated methods or reproducible results will find nothing usable.

I would not bring it to a reading group. I would not cite it. It does not deserve peer review until the experiments and comparisons are present and the mechanism claims can be inspected.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes MAVIN as the first framework for multi-shot audio-visual generation with customized narrative control. It introduces boundary-aware attention (leveraging hierarchical captions and boundary-aware token routing) to address temporal misalignment, ID-aware propagation (using identity embeddings and an identity-aware mask) to improve controllability for multi-subject scenarios, a multi-agent scripting pipeline to generate hierarchical captions from free-form inputs, and the MAVINSet dataset for training and evaluation. The paper asserts that extensive experiments demonstrate state-of-the-art performance.

Significance. If the proposed components are shown to deliver the claimed improvements in temporal consistency, identity binding, and narrative completeness, the work could open practical pathways for controllable generative models in professional audio-visual production workflows.

major comments (3)

[Abstract] Abstract: The claim that boundary-aware attention resolves temporal misalignment rests on the unverified assertion that 'boundary-aware token routing' enforces per-boundary isolation; no mechanism (e.g., hard masking, explicit boundary tokens, or leakage-prevention analysis) is described that would guarantee no cross-boundary leakage, which is load-bearing for the central misalignment-resolution claim.
[Abstract] Abstract: The ID-aware propagation is asserted to 'bind specific identities to consistent visual appearances and vocal timbres' via an 'identity-aware mask,' yet no details are given on how the mask prevents identity drift across shots (soft weighting vs. hard enforcement), undermining the controllability claim for multi-subject scenarios.
[Abstract] Abstract: The assertion of 'state-of-the-art performance' and 'extensive experiments' is made without reference to any metrics, baselines, quantitative results, error bars, or dataset statistics, making it impossible to evaluate whether the proposed methods actually outperform prior work.

minor comments (1)

[Abstract] The abstract uses several technical terms (boundary-aware token routing, identity-aware mask, hierarchical captions) without initial definitions or forward references to where they are formalized in the manuscript.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback focused on the abstract. We agree that the abstract would benefit from additional detail on the mechanisms and results to better support its claims. We will revise the abstract accordingly while ensuring the changes remain concise. Below we respond to each major comment.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that boundary-aware attention resolves temporal misalignment rests on the unverified assertion that 'boundary-aware token routing' enforces per-boundary isolation; no mechanism (e.g., hard masking, explicit boundary tokens, or leakage-prevention analysis) is described that would guarantee no cross-boundary leakage, which is load-bearing for the central misalignment-resolution claim.

Authors: We agree that the abstract, being concise, does not describe the implementation details of boundary-aware token routing or how it enforces isolation. The full manuscript provides these details in the methods section. To address the concern, we will revise the abstract to include a brief reference to the isolation mechanism as presented in the paper body. revision: yes
Referee: [Abstract] Abstract: The ID-aware propagation is asserted to 'bind specific identities to consistent visual appearances and vocal timbres' via an 'identity-aware mask,' yet no details are given on how the mask prevents identity drift across shots (soft weighting vs. hard enforcement), undermining the controllability claim for multi-subject scenarios.

Authors: We agree that the abstract does not specify how the identity-aware mask operates to prevent drift. The manuscript describes the identity embeddings and mask in the technical sections. We will revise the abstract to briefly clarify the mask's role in maintaining consistency across shots. revision: yes
Referee: [Abstract] Abstract: The assertion of 'state-of-the-art performance' and 'extensive experiments' is made without reference to any metrics, baselines, quantitative results, error bars, or dataset statistics, making it impossible to evaluate whether the proposed methods actually outperform prior work.

Authors: We agree that the abstract summarizes without citing specific metrics or results. The full manuscript presents the quantitative evaluation, baselines, metrics, error bars, and dataset statistics in the experiments section. We will revise the abstract to include a short reference to the key performance outcomes. revision: yes

Circularity Check

0 steps flagged

No circularity: framework components and dataset are independently specified

full rationale

The paper introduces boundary-aware attention, ID-aware propagation, a multi-agent scripting pipeline, and MAVINSet dataset as novel contributions without any equations, fitted parameters, or derivations that reduce to their own inputs. Claims of resolving misalignment and improving controllability are presented as design choices evaluated via experiments on the new dataset, with no self-citation load-bearing steps or renamings of prior results. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No technical details, equations, or methods are available from the abstract, so the ledger is empty.

pith-pipeline@v0.9.1-grok · 5738 in / 1163 out tokens · 44359 ms · 2026-06-30T07:15:20.928261+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

90 extracted references · 33 canonical work pages · 16 internal anchors

[1]

In: Asian Conference on Computer Vision (2020)

Bain, M., Nagrani, A., Brown, A., Zisserman, A.: Condensed movies: Story based retrieval with contextual embeddings. In: Asian Conference on Computer Vision (2020)

2020
[2]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Cai, S., Ceylan, D., Gadelha, M., Huang, C.H.P., Wang, T.Y., Wetzstein, G.: Generative rendering: Controllable 4D-guided video generation with 2D diffusion models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

2024
[3]

In: European Conference on Computer Vision (2020)

Chatterjee, M., Cherian, A.: Sound2Sight: Generating visual dynamics from sound and context. In: European Conference on Computer Vision (2020)

2020
[4]

IEEE International Conference on Acoustics, Speech and Signal Processing (2020)

Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: VGGSound: A large-scale audio- visual dataset. IEEE International Conference on Acoustics, Speech and Signal Processing (2020)

2020
[5]

Advances in Neural Information Processing Systems (2024)

Chen, K., Li, X., Li, Y., Tong, Y., Wu, J., Zeng, Y., Zhang, J., Zhou, Q.: Mo- tionBooth: Motion-aware customized text-to-video generation. Advances in Neural Information Processing Systems (2024)

2024
[6]

IEEE Journal of Selected Topics in Signal Processing (2022)

Chen,S.,Wang,C.,Chen,Z.,Wu,Y.,Liu,S.,Chen,Z.,Li,J.,Kanda,N.,Yoshioka, T., Xiao, X., et al.: WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing (2022)

2022
[7]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Chen, T.S., Siarohin, A., Menapace, W., Fang, Y., Lee, K.S., Skorokhodov, I., Aberman, K., Zhu, J.Y., Yang, M.H., Tulyakov, S.: Multi-subject open-set person- alization in video generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025
[8]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2025)

Chen, Y., Niu, Z., Ma, Z., Deng, K., Wang, C., JianZhao, J., Yu, K., Chen, X.: F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2025)

2025
[9]

Cheng, H.K., Ishii, M., Hayakawa, A., Shibuya, T., Schwing, A., Mitsufuji, Y.: MMAudio: Taming multimodal joint training for high-quality video-to-audio syn- thesis.IEEE/CVFConferenceonComputerVisionandPatternRecognition(2025)

2025
[10]

In: Asian Conference on Computer Vision (2016)

Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Asian Conference on Computer Vision (2016)

2016
[11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)

Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: Additive angular margin loss for deep face recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)

2019
[13]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Du, Z., Chen, Q., Zhang, S., Hu, K., Lu, H., Yang, Y., Hu, H., Zheng, S., Gu, Y., Ma, Z., et al.: CosyVoice: A scalable multilingual zero-shot text-to-speech synthe- sizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

ElevenLabs: Introducing Scribe v2.https://elevenlabs.io/blog/introducing- scribe-v2(2026), accessed: 2026-03-05

2026
[15]

In: IEEE International Conference on Acoustics, Speech and Signal Processing (2023)

Elizalde, B., Deshmukh, S., Al Ismail, M., Wang, H.: CLAP: learning audio con- cepts from natural language supervision. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2023)

2023
[16]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Gao, Y., Guo, H., Hoang, T., Huang, W., Jiang, L., Kong, F., Li, H., Li, J., Li, L., Li, X., et al.: Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113 (2025) 26 K. Liu et al

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Gemmeke, J.F., Ellis, D.P., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio set: An ontology and human-labeled dataset for audio events.In:IEEEinternationalconferenceonacoustics,speechandsignalprocessing (2017)

2017
[18]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Geng, D., Herrmann, C., Hur, J., Cole, F., Zhang, S., Pfaff, T., Lopez-Guevara, T., Aytar, Y., Rubinstein, M., Sun, C., Wang, O., Owens, A., Sun, D.: Motion Prompt- ing: Controlling video generation with motion trajectories. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025
[19]

arXiv preprint arXiv:2406.10221 (2025)

Ghermi, R., Wang, X., Kalogeiton, V., Laptev, I.: Long Story Short: Story-level video understanding from 20k short films. arXiv preprint arXiv:2406.10221 (2025)

work page arXiv 2025
[20]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

2023
[21]

LTX-2: Efficient Joint Audio-Visual Foundation Model

HaCohen, Y., Brazowski, B., Chiprut, N., Bitterman, Y., Kvochko, A., Berkowitz, A., Shalem, D., Lifschitz, D., Moshe, D., Porat, E., et al.: LTX-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

In: International Conference on Computer Vision (2025)

Haji-Ali, M., Menapace, W., Siarohin, A., Skorokhodov, I., Canberk, A., Lee, K.S., Ordonez, V., Tulyakov, S.: AV-Link: Temporally-aligned diffusion features for cross-modal audio-video generation. In: International Conference on Computer Vision (2025)

2025
[23]

In: ACM SIGGRAPH Asia Conference Papers (2025)

He,J.,Liu,H.,Li,J.,Huang,Z.,Yu,Q.,Ouyang,W.,Liu,Z.:Cut2next:Generating next shot via in-context tuning. In: ACM SIGGRAPH Asia Conference Papers (2025)

2025
[24]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Henschel, R., Khachatryan, L., Poghosyan, H., Hayrapetyan, D., Tadevosyan, V., Wang, Z., Navasardyan, S., Shi, H.: Streamingt2v: Consistent, dynamic, and ex- tendable long video generation from text. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025
[25]

In: International Conference on Learning Representations (2023)

Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: CogVideo: Large-scale pretrain- ing for text-to-video generation via transformers. In: International Conference on Learning Representations (2023)

2023
[26]

Qwen3-TTS Technical Report

Hu, H., Zhu, X., He, T., Guo, D., Zhang, B., Wang, X., Guo, Z., Jiang, Z., Hao, H., Guo, Z., et al.: Qwen3-TTS technical report. arXiv preprint arXiv:2601.15621 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

arXiv preprint arXiv:2410.23775 (2024)

Huang,L.,Wang,W.,Wu,Z.F.,Shi,Y.,Dou,H.,Liang,C.,Feng,Y.,Liu,Y.,Zhou, J.: In-context lora for diffusion transformers. arXiv preprint arXiv:2410.23775 (2024)

work page arXiv 2024
[28]

In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (2024)

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: VBench: Comprehensive benchmark suite for video genera- tive models. In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (2024)

2024
[29]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: VACE: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

2025
[30]

arXiv preprint arXiv:2403.03100 (2024)

Ju, Z., Wang, Y., Shen, K., Tan, X., Xin, D., Yang, D., Liu, Y., Leng, Y., Song, K., Tang, S., et al.: NaturalSpeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100 (2024)

work page arXiv 2024
[31]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025) MAVIN: Multi-Shot Audio-Visual Generation 27

Kara, O., Singh, K.K., Liu, F., Ceylan, D., Rehg, J.M., Hinz, T.: Shotadapter: Text-to-multi-shot video generation with diffusion models. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025) MAVIN: Multi-Shot Audio-Visual Generation 27

2025
[32]

Fr\'echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms

Kilgour, K., Zuluaga, M., Roblek, D., Sharifi, M.: Fr\’echet audio distance: A met- ric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[33]

IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (2025)

Kushwaha, S.S., Tian, Y.: VinTAGe: Joint video and text conditioning for holistic audio generation. IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (2025)

2025
[34]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Lei, G., Wang, C., Zhang, R., Wang, Y., Li, H., Xu, W.: AnimateAnything: Con- sistent and controllable animation for video generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025
[35]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Li, H., Xu, M., Zhan, Y., Mu, S., Li, J., Cheng, K., Chen, Y., Chen, T., Ye, M., Wang, J., Zhu, S.: Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025
[36]

In: International Conference on Computer Vision (2025)

Li, Q., Xing, Z., Wang, R., Zhang, H., Dai, Q., Wu, Z.: MagicMotion: Control- lable video generation with dense-to-sparse trajectory guidance. In: International Conference on Computer Vision (2025)

2025
[37]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Li, Z., Rahmani, H., Ke, Q., Liu, J.: LongDiff: Training-free long video generation in one go. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025
[38]

In: Conference on Language Modeling (2024)

Lin, H., Zala, A., Cho, J., Bansal, M.: VideoDirectorGPT: Consistent multi-scene video generation via llm-guided planning. In: Conference on Language Modeling (2024)

2024
[39]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

arXiv preprint arXiv:2301.12503 (2023)

Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., Wang, W., Plumbley, M.D.: AudioLDM: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503 (2023)

work page arXiv 2023
[41]

arXiv preprint arXiv:2503.23377 (2025)

Liu, K., Li, W., Chen, L., Wu, S., Zheng, Y., Ji, J., Zhou, F., Luo, J., Liu, Z., Fei, H., Chua, T.S.: JavisDiT: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization. arXiv preprint arXiv:2503.23377 (2025)

work page arXiv 2025
[42]

arXiv preprint arXiv:2602.19163 (2026)

Liu, K., Zheng, Y., Wang, K., Wu, S., Zhang, R., Luo, J., Hatzinakos, D., Liu, Z., Fei, H., Chua, T.S.: JavisDiT++: Unified modeling and optimization for joint audio-video generation. arXiv preprint arXiv:2602.19163 (2026)

work page arXiv 2026
[43]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Liu, L., Ma, T., Li, B., Chen, Z., Liu, J., Li, G., Zhou, S., He, Q., Wu, X.: Phantom: Subject-consistent video generation via cross-modal alignment. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025
[44]

European Conference on Computer Vision (2024)

Long, F., Qiu, Z., Yao, T., Mei, T.: VideoStudio: Generating consistent-content and multi-scene videos. European Conference on Computer Vision (2024)

2024
[45]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[46]

arXiv (2025)

Low, C., Wang, W., Katyal, C.: Ovi: Twin backbone cross-modal fusion for audio- video generation. arXiv (2025)

2025
[47]

Advances in Neural Information Processing Systems (2023)

Luo, S., Yan, C., Hu, C., Zhao, H.: Diff-Foley: Synchronized video-to-audio syn- thesis with latent diffusion models. Advances in Neural Information Processing Systems (2023)

2023
[48]

Advances in Neural Information Pro- cessing Systems (2024)

Montanaro, A., Savant Aira, L., Aiello, E., Valsesia, D., Magli, E.: Motioncraft: Physics-based zero-shot video generation. Advances in Neural Information Pro- cessing Systems (2024)

2024
[49]

arXiv preprint arXiv:2601.01568 (2026) 28 K

Qiang, C., Wang, J., Wang, X., Yin, K., Guo, Y.: MM-Sonate: Multimodal controllable audio-video generation with zero-shot voice cloning. arXiv preprint arXiv:2601.01568 (2026) 28 K. Liu et al

work page arXiv 2026
[50]

In: International Conference on Machine Learning (2023)

Radford,A.,Kim,J.W.,Xu,T.,Brockman,G.,McLeavey,C.,Sutskever,I.:Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning (2023)

2023
[51]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: SAM 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

Ruan, L., Ma, Y., Yang, H., He, H., Liu, B., Fu, J., Yuan, N.J., Jin, Q., Guo, B.: MM-Diffusion: Learning multi-modal diffusion models for joint audio and video generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

2023
[53]

com / christophschuhmann/improved-aesthetic-predictor(2022)

Schuhmann, C.: Improved aesthetic predictor.https : / / github . com / christophschuhmann/improved-aesthetic-predictor(2022)

2022
[54]

Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv preprint arXiv:2508.16930,

Shan, S., Li, Q., Cui, Y., Yang, M., Wang, Y., Yang, Q., Zhou, J., Zhong, Z.: HunyuanVideo-foley: Multimodal diffusion with representation alignment for high- fidelity foley audio generation. arXiv preprint arXiv:2508.16930 (2025)

work page arXiv 2025
[55]

arXiv preprint arXiv:2512.18099 (2025)

Shi, B., Tjandra, A., Hoffman, J., Wang, H., Wu, Y.C., Gao, L., Richter, J., Le, M., Vyas, A., Chen, S., et al.: SAM Audio: Segment anything in audio. arXiv preprint arXiv:2512.18099 (2025)

work page arXiv 2025
[56]

In: Cai, J., Kankanhalli, M.S., Prabhakaran, B., Boll, S., Subramanian, R., Zheng, L., Singh, V.K., César, P., Xie, L., Xu, D

Soucek, T., Lokoc, J.: Transnet V2: an effective deep network architecture for fast shot transition detection. In: Cai, J., Kankanhalli, M.S., Prabhakaran, B., Boll, S., Subramanian, R., Zheng, L., Singh, V.K., César, P., Xie, L., Xu, D. (eds.) ACM International Conference on Multimedia (2024)

2024
[57]

ACM Interna- tional Conference on Multimedia (2024)

Sun, M., Wang, W., Qiao, Y., Sun, J., Qin, Z., Guo, L., Zhu, X., Liu, J.: Mm-ldm: Multi-modal latent diffusion model for sounding video generation. ACM Interna- tional Conference on Multimedia (2024)

2024
[58]

arXiv preprint arXiv:2406.16260 (2024)

Tan, Z., Yang, X., Liu, S., Wang, X.: Video-infinity: Distributed long video gener- ation. arXiv preprint arXiv:2406.16260 (2024)

work page arXiv 2024
[59]

arXiv preprint arXiv:2602.08794 (2026)

Team, O., Yu, D., Chen, M., Chen, Q., Luo, Q., Wu, Q., Cheng, Q., Li, R., Liang, T., Zhang, W., et al.: MOVA: Towards scalable and synchronized video-audio gen- eration. arXiv preprint arXiv:2602.08794 (2026)

work page arXiv 2026
[60]

In: European Conference on Computer Vision (2020)

Teed, Z., Deng, J.: RAFT: Recurrent all-pairs field transforms for optical flow. In: European Conference on Computer Vision (2020)

2020
[61]

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Tjandra, A., Wu, Y.C., Guo, B., Hoffman, J., Ellis, B., Vyas, A., Shi, B., Chen, S., Le, M., Zacharov, N., Wood, C., Lee, A., Hsu, W.N.: Meta Audiobox Aesthetics: Unified automatic quality assessment for speech, music, and sound. arXiv preprint arXiv:2502.05139 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[63]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

arXiv preprint arXiv:2509.06155 (2025)

Wang, D., Zuo, W., Li, A., Chen, L.H., Liao, X., Zhou, D., Yin, Z., Dai, X., Jiang, D., Yu, G.: Universe-1: Unified audio-video generation via stitching of experts. arXiv preprint arXiv:2509.06155 (2025)

work page arXiv 2025
[65]

In: Advances in Neural Informa- tion Processing Systems (2025)

Wang, J., Sheng, H., Cai, S., Zhang, W., Yan, C., Feng, Y., Deng, B., Ye, J.: EchoShot: Multi-shot portrait video generation. In: Advances in Neural Informa- tion Processing Systems (2025)

2025
[66]

arXiv e-prints pp

Wang, J., Qiang, C., Guo, Y., Wang, Y., Zeng, X., Deng, F.: Apollo: Unified multi- task audio-video joint generation. arXiv e-prints pp. arXiv–2601 (2026) MAVIN: Multi-Shot Audio-Visual Generation 29

2026
[67]

ACM International Conference on Multimedia (2025)

Wang, K., Deng, S., Shi, J., Hatzinakos, D., Tian, Y.: Av-dit: Taming image diffu- sion transformers for efficient joint audio and video generation. ACM International Conference on Multimedia (2025)

2025
[68]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Wang, Y., He, Y., Li, Y., Li, K., Yu, J., Ma, X., Li, X., Chen, G., Chen, X., Wang, Y., et al.: InternVid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

ACM SIG- GRAPH Conference Papers (2024)

Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Mo- tionCtrl: A unified and flexible motion controller for video generation. ACM SIG- GRAPH Conference Papers (2024)

2024
[70]

In: Advances in Neural Information Processing Systems (2025)

Weng, S., Zheng, H., Chang, Z., Li, S., Shi, B., Wang, X.: Audio-sync video gen- eration with multi-stream temporal control. In: Advances in Neural Information Processing Systems (2025)

2025
[71]

wiseman: py-webrtcvad.https://github.com/wiseman/py-webrtcvad(2016)

2016
[72]

HunyuanVideo 1.5 Technical Report

Wu, B., Zou, C., Li, C., Huang, D., Yang, F., Tan, H., Peng, J., Wu, J., Xiong, J., Jiang, J., et al.: HunyuanVideo 1.5 technical report. arXiv preprint arXiv:2511.18870 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (2022)

Wu, H., Chen, K., Luo, Y., Qiao, R., Ren, B., Liu, H., Xie, W., Shen, L.: Scene consistency representation learning for video scene segmentation. IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (2022)

2022
[75]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Wu, W., Liu, M., Zhu, Z., Xia, X., Feng, H., Wang, W., Lin, K.Q., Shen, C., Shou, M.Z.: Moviebench: A hierarchical movie level dataset for long video generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025
[76]

arXiv preprint arXiv:2503.07314 (2025)

Wu, W., Zhu, Z., Shou, M.Z.: Automated movie generation via multi-agent CoT planning. arXiv preprint arXiv:2503.07314 (2025)

work page arXiv 2025
[77]

arXiv preprint arXiv:2508.11484 (2025)

Wu, X., Gao, B., Qiao, Y., Wang, Y., Chen, X.: CineTrans: Learning to gener- ate videos with cinematic transitions via masked diffusion models. arXiv preprint arXiv:2508.11484 (2025)

work page arXiv 2025
[78]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Wu, Z., Siarohin, A., Menapace, W., Skorokhodov, I., Fang, Y., Chordia, V., Gilitschenski, I., Tulyakov, S.: Mind the time: Temporally-controlled multi-event video generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025
[79]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Xing, Y., He, Y., Tian, Z., Wang, X., Chen, Q.: Seeing and hearing: Open- domain visual-audio generation with diffusion latent aligners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

2024
[80]

Qwen3-Omni Technical Report

Xu, J., Guo, Z., Hu, H., Chu, Y., Wang, X., He, J., Wang, Y., Shi, X., He, T., Zhu, X., et al.: Qwen3-Omni technical report. arXiv preprint arXiv:2509.17765 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

In: Asian Conference on Computer Vision (2020)

Bain, M., Nagrani, A., Brown, A., Zisserman, A.: Condensed movies: Story based retrieval with contextual embeddings. In: Asian Conference on Computer Vision (2020)

2020

[2] [2]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Cai, S., Ceylan, D., Gadelha, M., Huang, C.H.P., Wang, T.Y., Wetzstein, G.: Generative rendering: Controllable 4D-guided video generation with 2D diffusion models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

2024

[3] [3]

In: European Conference on Computer Vision (2020)

Chatterjee, M., Cherian, A.: Sound2Sight: Generating visual dynamics from sound and context. In: European Conference on Computer Vision (2020)

2020

[4] [4]

IEEE International Conference on Acoustics, Speech and Signal Processing (2020)

Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: VGGSound: A large-scale audio- visual dataset. IEEE International Conference on Acoustics, Speech and Signal Processing (2020)

2020

[5] [5]

Advances in Neural Information Processing Systems (2024)

Chen, K., Li, X., Li, Y., Tong, Y., Wu, J., Zeng, Y., Zhang, J., Zhou, Q.: Mo- tionBooth: Motion-aware customized text-to-video generation. Advances in Neural Information Processing Systems (2024)

2024

[6] [6]

IEEE Journal of Selected Topics in Signal Processing (2022)

Chen,S.,Wang,C.,Chen,Z.,Wu,Y.,Liu,S.,Chen,Z.,Li,J.,Kanda,N.,Yoshioka, T., Xiao, X., et al.: WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing (2022)

2022

[7] [7]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Chen, T.S., Siarohin, A., Menapace, W., Fang, Y., Lee, K.S., Skorokhodov, I., Aberman, K., Zhu, J.Y., Yang, M.H., Tulyakov, S.: Multi-subject open-set person- alization in video generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025

[8] [8]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2025)

Chen, Y., Niu, Z., Ma, Z., Deng, K., Wang, C., JianZhao, J., Yu, K., Chen, X.: F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2025)

2025

[9] [9]

Cheng, H.K., Ishii, M., Hayakawa, A., Shibuya, T., Schwing, A., Mitsufuji, Y.: MMAudio: Taming multimodal joint training for high-quality video-to-audio syn- thesis.IEEE/CVFConferenceonComputerVisionandPatternRecognition(2025)

2025

[10] [10]

In: Asian Conference on Computer Vision (2016)

Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Asian Conference on Computer Vision (2016)

2016

[11] [11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)

Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: Additive angular margin loss for deep face recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)

2019

[13] [13]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Du, Z., Chen, Q., Zhang, S., Hu, K., Lu, H., Yang, Y., Hu, H., Zheng, S., Gu, Y., Ma, Z., et al.: CosyVoice: A scalable multilingual zero-shot text-to-speech synthe- sizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

ElevenLabs: Introducing Scribe v2.https://elevenlabs.io/blog/introducing- scribe-v2(2026), accessed: 2026-03-05

2026

[15] [15]

In: IEEE International Conference on Acoustics, Speech and Signal Processing (2023)

Elizalde, B., Deshmukh, S., Al Ismail, M., Wang, H.: CLAP: learning audio con- cepts from natural language supervision. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2023)

2023

[16] [16]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Gao, Y., Guo, H., Hoang, T., Huang, W., Jiang, L., Kong, F., Li, H., Li, J., Li, L., Li, X., et al.: Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113 (2025) 26 K. Liu et al

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Gemmeke, J.F., Ellis, D.P., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio set: An ontology and human-labeled dataset for audio events.In:IEEEinternationalconferenceonacoustics,speechandsignalprocessing (2017)

2017

[18] [18]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Geng, D., Herrmann, C., Hur, J., Cole, F., Zhang, S., Pfaff, T., Lopez-Guevara, T., Aytar, Y., Rubinstein, M., Sun, C., Wang, O., Owens, A., Sun, D.: Motion Prompt- ing: Controlling video generation with motion trajectories. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025

[19] [19]

arXiv preprint arXiv:2406.10221 (2025)

Ghermi, R., Wang, X., Kalogeiton, V., Laptev, I.: Long Story Short: Story-level video understanding from 20k short films. arXiv preprint arXiv:2406.10221 (2025)

work page arXiv 2025

[20] [20]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

2023

[21] [21]

LTX-2: Efficient Joint Audio-Visual Foundation Model

HaCohen, Y., Brazowski, B., Chiprut, N., Bitterman, Y., Kvochko, A., Berkowitz, A., Shalem, D., Lifschitz, D., Moshe, D., Porat, E., et al.: LTX-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

In: International Conference on Computer Vision (2025)

Haji-Ali, M., Menapace, W., Siarohin, A., Skorokhodov, I., Canberk, A., Lee, K.S., Ordonez, V., Tulyakov, S.: AV-Link: Temporally-aligned diffusion features for cross-modal audio-video generation. In: International Conference on Computer Vision (2025)

2025

[23] [23]

In: ACM SIGGRAPH Asia Conference Papers (2025)

He,J.,Liu,H.,Li,J.,Huang,Z.,Yu,Q.,Ouyang,W.,Liu,Z.:Cut2next:Generating next shot via in-context tuning. In: ACM SIGGRAPH Asia Conference Papers (2025)

2025

[24] [24]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Henschel, R., Khachatryan, L., Poghosyan, H., Hayrapetyan, D., Tadevosyan, V., Wang, Z., Navasardyan, S., Shi, H.: Streamingt2v: Consistent, dynamic, and ex- tendable long video generation from text. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025

[25] [25]

In: International Conference on Learning Representations (2023)

Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: CogVideo: Large-scale pretrain- ing for text-to-video generation via transformers. In: International Conference on Learning Representations (2023)

2023

[26] [26]

Qwen3-TTS Technical Report

Hu, H., Zhu, X., He, T., Guo, D., Zhang, B., Wang, X., Guo, Z., Jiang, Z., Hao, H., Guo, Z., et al.: Qwen3-TTS technical report. arXiv preprint arXiv:2601.15621 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

arXiv preprint arXiv:2410.23775 (2024)

Huang,L.,Wang,W.,Wu,Z.F.,Shi,Y.,Dou,H.,Liang,C.,Feng,Y.,Liu,Y.,Zhou, J.: In-context lora for diffusion transformers. arXiv preprint arXiv:2410.23775 (2024)

work page arXiv 2024

[28] [28]

In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (2024)

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: VBench: Comprehensive benchmark suite for video genera- tive models. In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (2024)

2024

[29] [29]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: VACE: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

2025

[30] [30]

arXiv preprint arXiv:2403.03100 (2024)

Ju, Z., Wang, Y., Shen, K., Tan, X., Xin, D., Yang, D., Liu, Y., Leng, Y., Song, K., Tang, S., et al.: NaturalSpeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100 (2024)

work page arXiv 2024

[31] [31]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025) MAVIN: Multi-Shot Audio-Visual Generation 27

Kara, O., Singh, K.K., Liu, F., Ceylan, D., Rehg, J.M., Hinz, T.: Shotadapter: Text-to-multi-shot video generation with diffusion models. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025) MAVIN: Multi-Shot Audio-Visual Generation 27

2025

[32] [32]

Fr\'echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms

Kilgour, K., Zuluaga, M., Roblek, D., Sharifi, M.: Fr\’echet audio distance: A met- ric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[33] [33]

IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (2025)

Kushwaha, S.S., Tian, Y.: VinTAGe: Joint video and text conditioning for holistic audio generation. IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (2025)

2025

[34] [34]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Lei, G., Wang, C., Zhang, R., Wang, Y., Li, H., Xu, W.: AnimateAnything: Con- sistent and controllable animation for video generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025

[35] [35]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Li, H., Xu, M., Zhan, Y., Mu, S., Li, J., Cheng, K., Chen, Y., Chen, T., Ye, M., Wang, J., Zhu, S.: Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025

[36] [36]

In: International Conference on Computer Vision (2025)

Li, Q., Xing, Z., Wang, R., Zhang, H., Dai, Q., Wu, Z.: MagicMotion: Control- lable video generation with dense-to-sparse trajectory guidance. In: International Conference on Computer Vision (2025)

2025

[37] [37]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Li, Z., Rahmani, H., Ke, Q., Liu, J.: LongDiff: Training-free long video generation in one go. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025

[38] [38]

In: Conference on Language Modeling (2024)

Lin, H., Zala, A., Cho, J., Bansal, M.: VideoDirectorGPT: Consistent multi-scene video generation via llm-guided planning. In: Conference on Language Modeling (2024)

2024

[39] [39]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[40] [40]

arXiv preprint arXiv:2301.12503 (2023)

Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., Wang, W., Plumbley, M.D.: AudioLDM: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503 (2023)

work page arXiv 2023

[41] [41]

arXiv preprint arXiv:2503.23377 (2025)

Liu, K., Li, W., Chen, L., Wu, S., Zheng, Y., Ji, J., Zhou, F., Luo, J., Liu, Z., Fei, H., Chua, T.S.: JavisDiT: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization. arXiv preprint arXiv:2503.23377 (2025)

work page arXiv 2025

[42] [42]

arXiv preprint arXiv:2602.19163 (2026)

Liu, K., Zheng, Y., Wang, K., Wu, S., Zhang, R., Luo, J., Hatzinakos, D., Liu, Z., Fei, H., Chua, T.S.: JavisDiT++: Unified modeling and optimization for joint audio-video generation. arXiv preprint arXiv:2602.19163 (2026)

work page arXiv 2026

[43] [43]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Liu, L., Ma, T., Li, B., Chen, Z., Liu, J., Li, G., Zhou, S., He, Q., Wu, X.: Phantom: Subject-consistent video generation via cross-modal alignment. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025

[44] [44]

European Conference on Computer Vision (2024)

Long, F., Qiu, Z., Yao, T., Mei, T.: VideoStudio: Generating consistent-content and multi-scene videos. European Conference on Computer Vision (2024)

2024

[45] [45]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[46] [46]

arXiv (2025)

Low, C., Wang, W., Katyal, C.: Ovi: Twin backbone cross-modal fusion for audio- video generation. arXiv (2025)

2025

[47] [47]

Advances in Neural Information Processing Systems (2023)

Luo, S., Yan, C., Hu, C., Zhao, H.: Diff-Foley: Synchronized video-to-audio syn- thesis with latent diffusion models. Advances in Neural Information Processing Systems (2023)

2023

[48] [48]

Advances in Neural Information Pro- cessing Systems (2024)

Montanaro, A., Savant Aira, L., Aiello, E., Valsesia, D., Magli, E.: Motioncraft: Physics-based zero-shot video generation. Advances in Neural Information Pro- cessing Systems (2024)

2024

[49] [49]

arXiv preprint arXiv:2601.01568 (2026) 28 K

Qiang, C., Wang, J., Wang, X., Yin, K., Guo, Y.: MM-Sonate: Multimodal controllable audio-video generation with zero-shot voice cloning. arXiv preprint arXiv:2601.01568 (2026) 28 K. Liu et al

work page arXiv 2026

[50] [50]

In: International Conference on Machine Learning (2023)

Radford,A.,Kim,J.W.,Xu,T.,Brockman,G.,McLeavey,C.,Sutskever,I.:Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning (2023)

2023

[51] [51]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: SAM 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

Ruan, L., Ma, Y., Yang, H., He, H., Liu, B., Fu, J., Yuan, N.J., Jin, Q., Guo, B.: MM-Diffusion: Learning multi-modal diffusion models for joint audio and video generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

2023

[53] [53]

com / christophschuhmann/improved-aesthetic-predictor(2022)

Schuhmann, C.: Improved aesthetic predictor.https : / / github . com / christophschuhmann/improved-aesthetic-predictor(2022)

2022

[54] [54]

Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv preprint arXiv:2508.16930,

Shan, S., Li, Q., Cui, Y., Yang, M., Wang, Y., Yang, Q., Zhou, J., Zhong, Z.: HunyuanVideo-foley: Multimodal diffusion with representation alignment for high- fidelity foley audio generation. arXiv preprint arXiv:2508.16930 (2025)

work page arXiv 2025

[55] [55]

arXiv preprint arXiv:2512.18099 (2025)

Shi, B., Tjandra, A., Hoffman, J., Wang, H., Wu, Y.C., Gao, L., Richter, J., Le, M., Vyas, A., Chen, S., et al.: SAM Audio: Segment anything in audio. arXiv preprint arXiv:2512.18099 (2025)

work page arXiv 2025

[56] [56]

In: Cai, J., Kankanhalli, M.S., Prabhakaran, B., Boll, S., Subramanian, R., Zheng, L., Singh, V.K., César, P., Xie, L., Xu, D

Soucek, T., Lokoc, J.: Transnet V2: an effective deep network architecture for fast shot transition detection. In: Cai, J., Kankanhalli, M.S., Prabhakaran, B., Boll, S., Subramanian, R., Zheng, L., Singh, V.K., César, P., Xie, L., Xu, D. (eds.) ACM International Conference on Multimedia (2024)

2024

[57] [57]

ACM Interna- tional Conference on Multimedia (2024)

Sun, M., Wang, W., Qiao, Y., Sun, J., Qin, Z., Guo, L., Zhu, X., Liu, J.: Mm-ldm: Multi-modal latent diffusion model for sounding video generation. ACM Interna- tional Conference on Multimedia (2024)

2024

[58] [58]

arXiv preprint arXiv:2406.16260 (2024)

Tan, Z., Yang, X., Liu, S., Wang, X.: Video-infinity: Distributed long video gener- ation. arXiv preprint arXiv:2406.16260 (2024)

work page arXiv 2024

[59] [59]

arXiv preprint arXiv:2602.08794 (2026)

Team, O., Yu, D., Chen, M., Chen, Q., Luo, Q., Wu, Q., Cheng, Q., Li, R., Liang, T., Zhang, W., et al.: MOVA: Towards scalable and synchronized video-audio gen- eration. arXiv preprint arXiv:2602.08794 (2026)

work page arXiv 2026

[60] [60]

In: European Conference on Computer Vision (2020)

Teed, Z., Deng, J.: RAFT: Recurrent all-pairs field transforms for optical flow. In: European Conference on Computer Vision (2020)

2020

[61] [61]

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Tjandra, A., Wu, Y.C., Guo, B., Hoffman, J., Ellis, B., Vyas, A., Shi, B., Chen, S., Le, M., Zacharov, N., Wood, C., Lee, A., Hsu, W.N.: Meta Audiobox Aesthetics: Unified automatic quality assessment for speech, music, and sound. arXiv preprint arXiv:2502.05139 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[63] [63]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [64]

arXiv preprint arXiv:2509.06155 (2025)

Wang, D., Zuo, W., Li, A., Chen, L.H., Liao, X., Zhou, D., Yin, Z., Dai, X., Jiang, D., Yu, G.: Universe-1: Unified audio-video generation via stitching of experts. arXiv preprint arXiv:2509.06155 (2025)

work page arXiv 2025

[65] [65]

In: Advances in Neural Informa- tion Processing Systems (2025)

Wang, J., Sheng, H., Cai, S., Zhang, W., Yan, C., Feng, Y., Deng, B., Ye, J.: EchoShot: Multi-shot portrait video generation. In: Advances in Neural Informa- tion Processing Systems (2025)

2025

[66] [66]

arXiv e-prints pp

Wang, J., Qiang, C., Guo, Y., Wang, Y., Zeng, X., Deng, F.: Apollo: Unified multi- task audio-video joint generation. arXiv e-prints pp. arXiv–2601 (2026) MAVIN: Multi-Shot Audio-Visual Generation 29

2026

[67] [67]

ACM International Conference on Multimedia (2025)

Wang, K., Deng, S., Shi, J., Hatzinakos, D., Tian, Y.: Av-dit: Taming image diffu- sion transformers for efficient joint audio and video generation. ACM International Conference on Multimedia (2025)

2025

[68] [68]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Wang, Y., He, Y., Li, Y., Li, K., Yu, J., Ma, X., Li, X., Chen, G., Chen, X., Wang, Y., et al.: InternVid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[69] [69]

ACM SIG- GRAPH Conference Papers (2024)

Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Mo- tionCtrl: A unified and flexible motion controller for video generation. ACM SIG- GRAPH Conference Papers (2024)

2024

[70] [70]

In: Advances in Neural Information Processing Systems (2025)

Weng, S., Zheng, H., Chang, Z., Li, S., Shi, B., Wang, X.: Audio-sync video gen- eration with multi-stream temporal control. In: Advances in Neural Information Processing Systems (2025)

2025

[71] [71]

wiseman: py-webrtcvad.https://github.com/wiseman/py-webrtcvad(2016)

2016

[72] [72]

HunyuanVideo 1.5 Technical Report

Wu, B., Zou, C., Li, C., Huang, D., Yang, F., Tan, H., Peng, J., Wu, J., Xiong, J., Jiang, J., et al.: HunyuanVideo 1.5 technical report. arXiv preprint arXiv:2511.18870 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[73] [73]

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[74] [74]

IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (2022)

Wu, H., Chen, K., Luo, Y., Qiao, R., Ren, B., Liu, H., Xie, W., Shen, L.: Scene consistency representation learning for video scene segmentation. IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (2022)

2022

[75] [75]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Wu, W., Liu, M., Zhu, Z., Xia, X., Feng, H., Wang, W., Lin, K.Q., Shen, C., Shou, M.Z.: Moviebench: A hierarchical movie level dataset for long video generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025

[76] [76]

arXiv preprint arXiv:2503.07314 (2025)

Wu, W., Zhu, Z., Shou, M.Z.: Automated movie generation via multi-agent CoT planning. arXiv preprint arXiv:2503.07314 (2025)

work page arXiv 2025

[77] [77]

arXiv preprint arXiv:2508.11484 (2025)

Wu, X., Gao, B., Qiao, Y., Wang, Y., Chen, X.: CineTrans: Learning to gener- ate videos with cinematic transitions via masked diffusion models. arXiv preprint arXiv:2508.11484 (2025)

work page arXiv 2025

[78] [78]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Wu, Z., Siarohin, A., Menapace, W., Skorokhodov, I., Fang, Y., Chordia, V., Gilitschenski, I., Tulyakov, S.: Mind the time: Temporally-controlled multi-event video generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025

[79] [79]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Xing, Y., He, Y., Tian, Z., Wang, X., Chen, Q.: Seeing and hearing: Open- domain visual-audio generation with diffusion latent aligners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

2024

[80] [80]

Qwen3-Omni Technical Report

Xu, J., Guo, Z., Hu, H., Chu, Y., Wang, X., He, J., Wang, Y., Shi, X., He, T., Zhu, X., et al.: Qwen3-Omni technical report. arXiv preprint arXiv:2509.17765 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025