pith. sign in

arxiv: 2606.29473 · v1 · pith:TZGGVWPFnew · submitted 2026-06-28 · 💻 cs.CV

MAVIN: Multi-Shot Audio-Visual Generation with Narrative Control

Pith reviewed 2026-06-30 07:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-shot audio-visual generationnarrative controlboundary-aware attentionID-aware propagationmulti-agent scripting pipelineMAVINSet datasetgenerative modelsfilmmaking workflows
0
0 comments X

The pith

MAVIN provides the first framework for multi-shot audio-visual generation with customized narrative control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MAVIN to enable coherent generation of multiple video shots with audio that follow a user-provided narrative. Current models have trouble keeping timing consistent across shots, maintaining the same characters, and fully scripting the content. MAVIN uses boundary-aware attention to respect shot timings, ID-aware propagation to keep identities fixed, and a multi-agent system to create detailed scripts from simple inputs. It also releases the MAVINSet dataset and reports superior results compared to prior methods.

Core claim

MAVIN is the first framework for multi-shot audio-visual generation with customized narrative control. To resolve temporal misalignment, we propose boundary-aware attention, which leverages hierarchical captions and boundary-aware token routing to render audio-visual elements within their respective temporal boundaries. To improve the controllability for multi-subject scenarios, we propose ID-aware propagation, utilizing identity embeddings and an identity-aware mask to bind specific identities to consistent visual appearances and vocal timbres. To provide comprehensive audio-visual narratives, we present a multi-agent scripting pipeline to transform free-form user inputs into hierarchical c

What carries the argument

Boundary-aware attention for temporal alignment combined with ID-aware propagation for identity consistency across shots.

If this is right

  • Resolves temporal misalignment using boundary-aware attention with hierarchical captions.
  • Improves controllability in multi-subject cases via ID-aware propagation with embeddings and masks.
  • Generates comprehensive narratives from free-form inputs using multi-agent scripting.
  • Achieves state-of-the-art performance on multi-shot audio-visual tasks.
  • Opens integration of generative models into professional filmmaking workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The techniques could be applied to single-shot or longer-form video generation to improve consistency.
  • MAVINSet may become a standard benchmark for evaluating narrative coherence in generated media.
  • Combining this with real-time user feedback could lead to interactive storytelling tools.
  • Similar modular control strategies might address consistency issues in other generative domains like music or 3D animation.

Load-bearing premise

The boundary-aware attention, ID-aware propagation, and multi-agent scripting pipeline will successfully address temporal misalignment, limited controllability, and incomplete scripting.

What would settle it

A test on MAVINSet where generated sequences still exhibit audio-visual misalignment across shot boundaries or inconsistent character identities and timbres would show the proposed mechanisms fail to resolve the problems.

Figures

Figures reproduced from arXiv: 2606.29473 by Boxin Shi, Jing Wang, Kaiqi Liu, Kun Gai, Pengfei Wan, Qiulin Wang, Shuchen Weng, Xintao Wang, Yunyao Mao, Zheng Geng, Ziqi Cai.

Figure 1
Figure 1. Figure 1: Illustration of our MAVIN framework. First row: Leveraging boundary-aware attention, MAVIN enables precise temporal alignment for shot transitions, dialogue intervals, and narrative events. Second row: Through ID-aware propagation, MAVIN allows users to customize multiple subjects via reference images and audio, maintain￾ing identity consistency across complex cinematic narratives. Third row: With the mult… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our automatic data curation pipeline. (a) Shot grouping and filtering. We collect videos from diverse open-source datasets and online platforms. Raw videos are segmented into shots, grouped into multi-shot cinematic clips, and filtered for audio-visual aesthetics and clarity. (b) Hierarchical caption annotation. We generate multi-level captions to capture global storylines, shot-level events, a… view at source ↗
Figure 3
Figure 3. Figure 3: Given free-form user inputs (purple block), we first employ a multi-agent scripting pipeline (red block) to transform them into hierarchical captions (blue block), detailing complex narrative timelines with precise shot boundaries and dia￾logue intervals. Within the ID-aware propagation, input video clips and optional image references are compressed by a Video VAE into latent tokens (orange block). The aud… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison with state-of-the-art relevant generation models. 5.4 Ablation Study As presented in Tab. 2, we ablate key components of our framework to eval￾uate their individual contributions and report these quantitative results. Since some ablations target customized generation capabilities, we conduct these ex￾periments with explicit video and audio references. W/o BTR (Boundary-aware Token Ro… view at source ↗
read the original abstract

While recent generative models produce high-fidelity videos, they struggle with the complex narrative control required for coherent multi-shot audio-visual generation. Existing methods suffer from temporal misalignment, limited controllability, and incomplete scripting. In this paper, we propose MAVIN, the first framework for multi-shot audio-visual generation with customized narrative control. To resolve temporal misalignment, we propose boundary-aware attention, which leverages hierarchical captions and boundary-aware token routing to render audio-visual elements within their respective temporal boundaries. To improve the controllability for multi-subject scenarios, we propose ID-aware propagation, utilizing identity embeddings and an identity-aware mask to bind specific identities to consistent visual appearances and vocal timbres. To provide comprehensive audio-visual narratives, we present a multi-agent scripting pipeline to transform free-form user inputs into hierarchical captions. Furthermore, we construct MAVINSet, a multi-shot audio-visual dataset for robust training and evaluation. Extensive experiments demonstrate that MAVIN achieves state-of-the-art performance, opening up a new avenue for integrating generative models into professional filmmaking workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes MAVIN as the first framework for multi-shot audio-visual generation with customized narrative control. It introduces boundary-aware attention (leveraging hierarchical captions and boundary-aware token routing) to address temporal misalignment, ID-aware propagation (using identity embeddings and an identity-aware mask) to improve controllability for multi-subject scenarios, a multi-agent scripting pipeline to generate hierarchical captions from free-form inputs, and the MAVINSet dataset for training and evaluation. The paper asserts that extensive experiments demonstrate state-of-the-art performance.

Significance. If the proposed components are shown to deliver the claimed improvements in temporal consistency, identity binding, and narrative completeness, the work could open practical pathways for controllable generative models in professional audio-visual production workflows.

major comments (3)
  1. [Abstract] Abstract: The claim that boundary-aware attention resolves temporal misalignment rests on the unverified assertion that 'boundary-aware token routing' enforces per-boundary isolation; no mechanism (e.g., hard masking, explicit boundary tokens, or leakage-prevention analysis) is described that would guarantee no cross-boundary leakage, which is load-bearing for the central misalignment-resolution claim.
  2. [Abstract] Abstract: The ID-aware propagation is asserted to 'bind specific identities to consistent visual appearances and vocal timbres' via an 'identity-aware mask,' yet no details are given on how the mask prevents identity drift across shots (soft weighting vs. hard enforcement), undermining the controllability claim for multi-subject scenarios.
  3. [Abstract] Abstract: The assertion of 'state-of-the-art performance' and 'extensive experiments' is made without reference to any metrics, baselines, quantitative results, error bars, or dataset statistics, making it impossible to evaluate whether the proposed methods actually outperform prior work.
minor comments (1)
  1. [Abstract] The abstract uses several technical terms (boundary-aware token routing, identity-aware mask, hierarchical captions) without initial definitions or forward references to where they are formalized in the manuscript.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback focused on the abstract. We agree that the abstract would benefit from additional detail on the mechanisms and results to better support its claims. We will revise the abstract accordingly while ensuring the changes remain concise. Below we respond to each major comment.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that boundary-aware attention resolves temporal misalignment rests on the unverified assertion that 'boundary-aware token routing' enforces per-boundary isolation; no mechanism (e.g., hard masking, explicit boundary tokens, or leakage-prevention analysis) is described that would guarantee no cross-boundary leakage, which is load-bearing for the central misalignment-resolution claim.

    Authors: We agree that the abstract, being concise, does not describe the implementation details of boundary-aware token routing or how it enforces isolation. The full manuscript provides these details in the methods section. To address the concern, we will revise the abstract to include a brief reference to the isolation mechanism as presented in the paper body. revision: yes

  2. Referee: [Abstract] Abstract: The ID-aware propagation is asserted to 'bind specific identities to consistent visual appearances and vocal timbres' via an 'identity-aware mask,' yet no details are given on how the mask prevents identity drift across shots (soft weighting vs. hard enforcement), undermining the controllability claim for multi-subject scenarios.

    Authors: We agree that the abstract does not specify how the identity-aware mask operates to prevent drift. The manuscript describes the identity embeddings and mask in the technical sections. We will revise the abstract to briefly clarify the mask's role in maintaining consistency across shots. revision: yes

  3. Referee: [Abstract] Abstract: The assertion of 'state-of-the-art performance' and 'extensive experiments' is made without reference to any metrics, baselines, quantitative results, error bars, or dataset statistics, making it impossible to evaluate whether the proposed methods actually outperform prior work.

    Authors: We agree that the abstract summarizes without citing specific metrics or results. The full manuscript presents the quantitative evaluation, baselines, metrics, error bars, and dataset statistics in the experiments section. We will revise the abstract to include a short reference to the key performance outcomes. revision: yes

Circularity Check

0 steps flagged

No circularity: framework components and dataset are independently specified

full rationale

The paper introduces boundary-aware attention, ID-aware propagation, a multi-agent scripting pipeline, and MAVINSet dataset as novel contributions without any equations, fitted parameters, or derivations that reduce to their own inputs. Claims of resolving misalignment and improving controllability are presented as design choices evaluated via experiments on the new dataset, with no self-citation load-bearing steps or renamings of prior results. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No technical details, equations, or methods are available from the abstract, so the ledger is empty.

pith-pipeline@v0.9.1-grok · 5738 in / 1163 out tokens · 44359 ms · 2026-06-30T07:15:20.928261+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 33 canonical work pages · 16 internal anchors

  1. [1]

    In: Asian Conference on Computer Vision (2020)

    Bain, M., Nagrani, A., Brown, A., Zisserman, A.: Condensed movies: Story based retrieval with contextual embeddings. In: Asian Conference on Computer Vision (2020)

  2. [2]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

    Cai, S., Ceylan, D., Gadelha, M., Huang, C.H.P., Wang, T.Y., Wetzstein, G.: Generative rendering: Controllable 4D-guided video generation with 2D diffusion models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

  3. [3]

    In: European Conference on Computer Vision (2020)

    Chatterjee, M., Cherian, A.: Sound2Sight: Generating visual dynamics from sound and context. In: European Conference on Computer Vision (2020)

  4. [4]

    IEEE International Conference on Acoustics, Speech and Signal Processing (2020)

    Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: VGGSound: A large-scale audio- visual dataset. IEEE International Conference on Acoustics, Speech and Signal Processing (2020)

  5. [5]

    Advances in Neural Information Processing Systems (2024)

    Chen, K., Li, X., Li, Y., Tong, Y., Wu, J., Zeng, Y., Zhang, J., Zhou, Q.: Mo- tionBooth: Motion-aware customized text-to-video generation. Advances in Neural Information Processing Systems (2024)

  6. [6]

    IEEE Journal of Selected Topics in Signal Processing (2022)

    Chen,S.,Wang,C.,Chen,Z.,Wu,Y.,Liu,S.,Chen,Z.,Li,J.,Kanda,N.,Yoshioka, T., Xiao, X., et al.: WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing (2022)

  7. [7]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

    Chen, T.S., Siarohin, A., Menapace, W., Fang, Y., Lee, K.S., Skorokhodov, I., Aberman, K., Zhu, J.Y., Yang, M.H., Tulyakov, S.: Multi-subject open-set person- alization in video generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

  8. [8]

    In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2025)

    Chen, Y., Niu, Z., Ma, Z., Deng, K., Wang, C., JianZhao, J., Yu, K., Chen, X.: F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2025)

  9. [9]

    Cheng, H.K., Ishii, M., Hayakawa, A., Shibuya, T., Schwing, A., Mitsufuji, Y.: MMAudio: Taming multimodal joint training for high-quality video-to-audio syn- thesis.IEEE/CVFConferenceonComputerVisionandPatternRecognition(2025)

  10. [10]

    In: Asian Conference on Computer Vision (2016)

    Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Asian Conference on Computer Vision (2016)

  11. [11]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  12. [12]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)

    Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: Additive angular margin loss for deep face recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)

  13. [13]

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Du, Z., Chen, Q., Zhang, S., Hu, K., Lu, H., Yang, Y., Hu, H., Zheng, S., Gu, Y., Ma, Z., et al.: CosyVoice: A scalable multilingual zero-shot text-to-speech synthe- sizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407 (2024)

  14. [14]

    ElevenLabs: Introducing Scribe v2.https://elevenlabs.io/blog/introducing- scribe-v2(2026), accessed: 2026-03-05

  15. [15]

    In: IEEE International Conference on Acoustics, Speech and Signal Processing (2023)

    Elizalde, B., Deshmukh, S., Al Ismail, M., Wang, H.: CLAP: learning audio con- cepts from natural language supervision. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2023)

  16. [16]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Gao, Y., Guo, H., Hoang, T., Huang, W., Jiang, L., Kong, F., Li, H., Li, J., Li, L., Li, X., et al.: Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113 (2025) 26 K. Liu et al

  17. [17]

    Gemmeke, J.F., Ellis, D.P., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio set: An ontology and human-labeled dataset for audio events.In:IEEEinternationalconferenceonacoustics,speechandsignalprocessing (2017)

  18. [18]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

    Geng, D., Herrmann, C., Hur, J., Cole, F., Zhang, S., Pfaff, T., Lopez-Guevara, T., Aytar, Y., Rubinstein, M., Sun, C., Wang, O., Owens, A., Sun, D.: Motion Prompt- ing: Controlling video generation with motion trajectories. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

  19. [19]

    arXiv preprint arXiv:2406.10221 (2025)

    Ghermi, R., Wang, X., Kalogeiton, V., Laptev, I.: Long Story Short: Story-level video understanding from 20k short films. arXiv preprint arXiv:2406.10221 (2025)

  20. [20]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

    Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

  21. [21]

    LTX-2: Efficient Joint Audio-Visual Foundation Model

    HaCohen, Y., Brazowski, B., Chiprut, N., Bitterman, Y., Kvochko, A., Berkowitz, A., Shalem, D., Lifschitz, D., Moshe, D., Porat, E., et al.: LTX-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233 (2026)

  22. [22]

    In: International Conference on Computer Vision (2025)

    Haji-Ali, M., Menapace, W., Siarohin, A., Skorokhodov, I., Canberk, A., Lee, K.S., Ordonez, V., Tulyakov, S.: AV-Link: Temporally-aligned diffusion features for cross-modal audio-video generation. In: International Conference on Computer Vision (2025)

  23. [23]

    In: ACM SIGGRAPH Asia Conference Papers (2025)

    He,J.,Liu,H.,Li,J.,Huang,Z.,Yu,Q.,Ouyang,W.,Liu,Z.:Cut2next:Generating next shot via in-context tuning. In: ACM SIGGRAPH Asia Conference Papers (2025)

  24. [24]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

    Henschel, R., Khachatryan, L., Poghosyan, H., Hayrapetyan, D., Tadevosyan, V., Wang, Z., Navasardyan, S., Shi, H.: Streamingt2v: Consistent, dynamic, and ex- tendable long video generation from text. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

  25. [25]

    In: International Conference on Learning Representations (2023)

    Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: CogVideo: Large-scale pretrain- ing for text-to-video generation via transformers. In: International Conference on Learning Representations (2023)

  26. [26]

    Qwen3-TTS Technical Report

    Hu, H., Zhu, X., He, T., Guo, D., Zhang, B., Wang, X., Guo, Z., Jiang, Z., Hao, H., Guo, Z., et al.: Qwen3-TTS technical report. arXiv preprint arXiv:2601.15621 (2026)

  27. [27]

    arXiv preprint arXiv:2410.23775 (2024)

    Huang,L.,Wang,W.,Wu,Z.F.,Shi,Y.,Dou,H.,Liang,C.,Feng,Y.,Liu,Y.,Zhou, J.: In-context lora for diffusion transformers. arXiv preprint arXiv:2410.23775 (2024)

  28. [28]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (2024)

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: VBench: Comprehensive benchmark suite for video genera- tive models. In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (2024)

  29. [29]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

    Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: VACE: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

  30. [30]

    arXiv preprint arXiv:2403.03100 (2024)

    Ju, Z., Wang, Y., Shen, K., Tan, X., Xin, D., Yang, D., Liu, Y., Leng, Y., Song, K., Tang, S., et al.: NaturalSpeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100 (2024)

  31. [31]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025) MAVIN: Multi-Shot Audio-Visual Generation 27

    Kara, O., Singh, K.K., Liu, F., Ceylan, D., Rehg, J.M., Hinz, T.: Shotadapter: Text-to-multi-shot video generation with diffusion models. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025) MAVIN: Multi-Shot Audio-Visual Generation 27

  32. [32]

    Fr\'echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms

    Kilgour, K., Zuluaga, M., Roblek, D., Sharifi, M.: Fr\’echet audio distance: A met- ric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466 (2018)

  33. [33]

    IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (2025)

    Kushwaha, S.S., Tian, Y.: VinTAGe: Joint video and text conditioning for holistic audio generation. IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (2025)

  34. [34]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

    Lei, G., Wang, C., Zhang, R., Wang, Y., Li, H., Xu, W.: AnimateAnything: Con- sistent and controllable animation for video generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

  35. [35]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

    Li, H., Xu, M., Zhan, Y., Mu, S., Li, J., Cheng, K., Chen, Y., Chen, T., Ye, M., Wang, J., Zhu, S.: Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

  36. [36]

    In: International Conference on Computer Vision (2025)

    Li, Q., Xing, Z., Wang, R., Zhang, H., Dai, Q., Wu, Z.: MagicMotion: Control- lable video generation with dense-to-sparse trajectory guidance. In: International Conference on Computer Vision (2025)

  37. [37]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

    Li, Z., Rahmani, H., Ke, Q., Liu, J.: LongDiff: Training-free long video generation in one go. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

  38. [38]

    In: Conference on Language Modeling (2024)

    Lin, H., Zala, A., Cho, J., Bansal, M.: VideoDirectorGPT: Consistent multi-scene video generation via llm-guided planning. In: Conference on Language Modeling (2024)

  39. [39]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

  40. [40]

    arXiv preprint arXiv:2301.12503 (2023)

    Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., Wang, W., Plumbley, M.D.: AudioLDM: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503 (2023)

  41. [41]

    arXiv preprint arXiv:2503.23377 (2025)

    Liu, K., Li, W., Chen, L., Wu, S., Zheng, Y., Ji, J., Zhou, F., Luo, J., Liu, Z., Fei, H., Chua, T.S.: JavisDiT: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization. arXiv preprint arXiv:2503.23377 (2025)

  42. [42]

    arXiv preprint arXiv:2602.19163 (2026)

    Liu, K., Zheng, Y., Wang, K., Wu, S., Zhang, R., Luo, J., Hatzinakos, D., Liu, Z., Fei, H., Chua, T.S.: JavisDiT++: Unified modeling and optimization for joint audio-video generation. arXiv preprint arXiv:2602.19163 (2026)

  43. [43]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

    Liu, L., Ma, T., Li, B., Chen, Z., Liu, J., Li, G., Zhou, S., He, Q., Wu, X.: Phantom: Subject-consistent video generation via cross-modal alignment. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

  44. [44]

    European Conference on Computer Vision (2024)

    Long, F., Qiu, Z., Yao, T., Mei, T.: VideoStudio: Generating consistent-content and multi-scene videos. European Conference on Computer Vision (2024)

  45. [45]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  46. [46]

    arXiv (2025)

    Low, C., Wang, W., Katyal, C.: Ovi: Twin backbone cross-modal fusion for audio- video generation. arXiv (2025)

  47. [47]

    Advances in Neural Information Processing Systems (2023)

    Luo, S., Yan, C., Hu, C., Zhao, H.: Diff-Foley: Synchronized video-to-audio syn- thesis with latent diffusion models. Advances in Neural Information Processing Systems (2023)

  48. [48]

    Advances in Neural Information Pro- cessing Systems (2024)

    Montanaro, A., Savant Aira, L., Aiello, E., Valsesia, D., Magli, E.: Motioncraft: Physics-based zero-shot video generation. Advances in Neural Information Pro- cessing Systems (2024)

  49. [49]

    arXiv preprint arXiv:2601.01568 (2026) 28 K

    Qiang, C., Wang, J., Wang, X., Yin, K., Guo, Y.: MM-Sonate: Multimodal controllable audio-video generation with zero-shot voice cloning. arXiv preprint arXiv:2601.01568 (2026) 28 K. Liu et al

  50. [50]

    In: International Conference on Machine Learning (2023)

    Radford,A.,Kim,J.W.,Xu,T.,Brockman,G.,McLeavey,C.,Sutskever,I.:Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning (2023)

  51. [51]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: SAM 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

  52. [52]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

    Ruan, L., Ma, Y., Yang, H., He, H., Liu, B., Fu, J., Yuan, N.J., Jin, Q., Guo, B.: MM-Diffusion: Learning multi-modal diffusion models for joint audio and video generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

  53. [53]

    com / christophschuhmann/improved-aesthetic-predictor(2022)

    Schuhmann, C.: Improved aesthetic predictor.https : / / github . com / christophschuhmann/improved-aesthetic-predictor(2022)

  54. [54]

    Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv preprint arXiv:2508.16930,

    Shan, S., Li, Q., Cui, Y., Yang, M., Wang, Y., Yang, Q., Zhou, J., Zhong, Z.: HunyuanVideo-foley: Multimodal diffusion with representation alignment for high- fidelity foley audio generation. arXiv preprint arXiv:2508.16930 (2025)

  55. [55]

    arXiv preprint arXiv:2512.18099 (2025)

    Shi, B., Tjandra, A., Hoffman, J., Wang, H., Wu, Y.C., Gao, L., Richter, J., Le, M., Vyas, A., Chen, S., et al.: SAM Audio: Segment anything in audio. arXiv preprint arXiv:2512.18099 (2025)

  56. [56]

    In: Cai, J., Kankanhalli, M.S., Prabhakaran, B., Boll, S., Subramanian, R., Zheng, L., Singh, V.K., César, P., Xie, L., Xu, D

    Soucek, T., Lokoc, J.: Transnet V2: an effective deep network architecture for fast shot transition detection. In: Cai, J., Kankanhalli, M.S., Prabhakaran, B., Boll, S., Subramanian, R., Zheng, L., Singh, V.K., César, P., Xie, L., Xu, D. (eds.) ACM International Conference on Multimedia (2024)

  57. [57]

    ACM Interna- tional Conference on Multimedia (2024)

    Sun, M., Wang, W., Qiao, Y., Sun, J., Qin, Z., Guo, L., Zhu, X., Liu, J.: Mm-ldm: Multi-modal latent diffusion model for sounding video generation. ACM Interna- tional Conference on Multimedia (2024)

  58. [58]

    arXiv preprint arXiv:2406.16260 (2024)

    Tan, Z., Yang, X., Liu, S., Wang, X.: Video-infinity: Distributed long video gener- ation. arXiv preprint arXiv:2406.16260 (2024)

  59. [59]

    arXiv preprint arXiv:2602.08794 (2026)

    Team, O., Yu, D., Chen, M., Chen, Q., Luo, Q., Wu, Q., Cheng, Q., Li, R., Liang, T., Zhang, W., et al.: MOVA: Towards scalable and synchronized video-audio gen- eration. arXiv preprint arXiv:2602.08794 (2026)

  60. [60]

    In: European Conference on Computer Vision (2020)

    Teed, Z., Deng, J.: RAFT: Recurrent all-pairs field transforms for optical flow. In: European Conference on Computer Vision (2020)

  61. [61]

    Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

    Tjandra, A., Wu, Y.C., Guo, B., Hoffman, J., Ellis, B., Vyas, A., Shi, B., Chen, S., Le, M., Zacharov, N., Wood, C., Lee, A., Hsu, W.N.: Meta Audiobox Aesthetics: Unified automatic quality assessment for speech, music, and sound. arXiv preprint arXiv:2502.05139 (2025)

  62. [62]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)

  63. [63]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

  64. [64]

    arXiv preprint arXiv:2509.06155 (2025)

    Wang, D., Zuo, W., Li, A., Chen, L.H., Liao, X., Zhou, D., Yin, Z., Dai, X., Jiang, D., Yu, G.: Universe-1: Unified audio-video generation via stitching of experts. arXiv preprint arXiv:2509.06155 (2025)

  65. [65]

    In: Advances in Neural Informa- tion Processing Systems (2025)

    Wang, J., Sheng, H., Cai, S., Zhang, W., Yan, C., Feng, Y., Deng, B., Ye, J.: EchoShot: Multi-shot portrait video generation. In: Advances in Neural Informa- tion Processing Systems (2025)

  66. [66]

    arXiv e-prints pp

    Wang, J., Qiang, C., Guo, Y., Wang, Y., Zeng, X., Deng, F.: Apollo: Unified multi- task audio-video joint generation. arXiv e-prints pp. arXiv–2601 (2026) MAVIN: Multi-Shot Audio-Visual Generation 29

  67. [67]

    ACM International Conference on Multimedia (2025)

    Wang, K., Deng, S., Shi, J., Hatzinakos, D., Tian, Y.: Av-dit: Taming image diffu- sion transformers for efficient joint audio and video generation. ACM International Conference on Multimedia (2025)

  68. [68]

    InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

    Wang, Y., He, Y., Li, Y., Li, K., Yu, J., Ma, X., Li, X., Chen, G., Chen, X., Wang, Y., et al.: InternVid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942 (2023)

  69. [69]

    ACM SIG- GRAPH Conference Papers (2024)

    Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Mo- tionCtrl: A unified and flexible motion controller for video generation. ACM SIG- GRAPH Conference Papers (2024)

  70. [70]

    In: Advances in Neural Information Processing Systems (2025)

    Weng, S., Zheng, H., Chang, Z., Li, S., Shi, B., Wang, X.: Audio-sync video gen- eration with multi-stream temporal control. In: Advances in Neural Information Processing Systems (2025)

  71. [71]

    wiseman: py-webrtcvad.https://github.com/wiseman/py-webrtcvad(2016)

  72. [72]

    HunyuanVideo 1.5 Technical Report

    Wu, B., Zou, C., Li, C., Huang, D., Yang, F., Tan, H., Peng, J., Wu, J., Xiong, J., Jiang, J., et al.: HunyuanVideo 1.5 technical report. arXiv preprint arXiv:2511.18870 (2025)

  73. [73]

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

  74. [74]

    IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (2022)

    Wu, H., Chen, K., Luo, Y., Qiao, R., Ren, B., Liu, H., Xie, W., Shen, L.: Scene consistency representation learning for video scene segmentation. IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (2022)

  75. [75]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

    Wu, W., Liu, M., Zhu, Z., Xia, X., Feng, H., Wang, W., Lin, K.Q., Shen, C., Shou, M.Z.: Moviebench: A hierarchical movie level dataset for long video generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

  76. [76]

    arXiv preprint arXiv:2503.07314 (2025)

    Wu, W., Zhu, Z., Shou, M.Z.: Automated movie generation via multi-agent CoT planning. arXiv preprint arXiv:2503.07314 (2025)

  77. [77]

    arXiv preprint arXiv:2508.11484 (2025)

    Wu, X., Gao, B., Qiao, Y., Wang, Y., Chen, X.: CineTrans: Learning to gener- ate videos with cinematic transitions via masked diffusion models. arXiv preprint arXiv:2508.11484 (2025)

  78. [78]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

    Wu, Z., Siarohin, A., Menapace, W., Skorokhodov, I., Fang, Y., Chordia, V., Gilitschenski, I., Tulyakov, S.: Mind the time: Temporally-controlled multi-event video generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

  79. [79]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

    Xing, Y., He, Y., Tian, Z., Wang, X., Chen, Q.: Seeing and hearing: Open- domain visual-audio generation with diffusion latent aligners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

  80. [80]

    Qwen3-Omni Technical Report

    Xu, J., Guo, Z., Hu, H., Chu, Y., Wang, X., He, J., Wang, Y., Shi, X., He, T., Zhu, X., et al.: Qwen3-Omni technical report. arXiv preprint arXiv:2509.17765 (2025)

Showing first 80 references.