pith. machine review for the scientific record. sign in

arxiv: 2605.07061 · v1 · submitted 2026-05-08 · 💻 cs.SD · cs.AI· cs.CV· cs.MM

Recognition: 2 theorem links

· Lean Theorem

Do Joint Audio-Video Generation Models Understand Physics?

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:08 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CVcs.MM
keywords audio-video generationphysical commonsensebenchmarkjoint multimodal modelsscene transitionscross-modal consistencyphysics violations
0
0 comments X

The pith

Joint audio-video generation models generate plausible content but fail to maintain real-world physical consistency across transitions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AV-Phys Bench to test if joint audio-video models understand physics or merely produce statistically likely outputs. It assesses models on steady scenes, event transitions, environment changes, and prompts that request impossible physics violations. Even the strongest models show sharp drops in performance on dynamic transitions and break down entirely on anti-physics requests, highlighting gaps in cross-modal consistency. This evaluation matters because these systems are approaching professional production quality, where violations of physical rules would limit their reliability in real applications. The authors also provide an automated ReAct-style agent using multimodal language models and acoustic tools that produces rankings close to human judgments.

Core claim

Joint audio-video generation models do not possess robust understanding of audio-visual physics. They perform adequately on steady-state scenes but degrade markedly on event-driven and environment-driven transitions. Leading proprietary and open-source systems all collapse when presented with Anti-AV-Physics prompts that request physically inconsistent audio-video pairings, indicating reliance on pattern matching rather than physical principles.

What carries the argument

AV-Phys Bench, a benchmark using three scene categories (Steady State, Event Transition, Environment Transition), Anti-AV-Physics prompts, and five evaluation dimensions of semantic adherence and physical commonsense.

Load-bearing premise

The five evaluation dimensions together give a sufficient and unbiased measure of physical understanding without needing further human validation or external physics simulators.

What would settle it

Running the top models on the Anti-AV-Physics prompts and finding that their generated audio and video both remain internally consistent with real physics, contrary to the benchmark scores, would challenge the conclusion that they lack robust physical understanding.

Figures

Figures reproduced from arXiv: 2605.07061 by Chenming Ge, Feiyu Du, Hao Fang, Jiageng Liu, Mingwei Xu, Shijian Deng, Weiguo Pian, Xiulong Liu, Yapeng Tian, Zexin Xu, Zijun Cui.

Figure 1
Figure 1. Figure 1: In the physical world, vision and sound are two observations of the same physical event. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: AV-Phys Bench construction and evaluation pipeline. (a) A physics-grounded taxonomy organizes prompts by how the underlying physics evolve within a clip. Human-in-the-loop curation produces prompts that encode specific, verifiable acoustic outcomes, and each prompt is paired with a five-dimension evaluation rubric covering semantic adherence (SA) and physical commonsense (PC) across video (V), audio (A), a… view at source ↗
Figure 3
Figure 3. Figure 3: AV-Phys Bench’s three scene categories of physics-following prompts, with a per-category [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Worked rubric example for a Category-3 Environment Transition prompt. Top: four frames from the Seedance 2.0 generation showing the outdoor-to-indoor transition through a heavy door. Bottom: the prompt-specific rubric used for evaluation. V-SA and A-SA check whether the described visual entities and sounds are present. V-PC, A-PC, and AV-PC check whether the generated video, audio, and their alignment obey… view at source ↗
Figure 5
Figure 5. Figure 5: Seedance 2.0, Event Transition. The clip strikes the xylophone bars from the longest to the [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Kling 3.0 Omni, Environment Transition. The clip captures the speaker submerging mid [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Veo 3.1, Event Transition. The clip places a large dog on the left and a small dog on the [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: LTX-2.3, Environment Transition. The clip cuts from a packed stadium interior to an [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ovi, Environment Transition. The fire truck is visible only in the first frame and then [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: JavisDiT++, Event Transition. The clip shows a person knocking gently three times and [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: MagiHuman, Event Transition. The clip shows a guitarist plucking a string but never [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Seedance 2.0, Steady State Anti-Physics. The model paints the gym floor with a thin layer [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Seedance 2.0, Event Transition Anti-Physics. With the handheld microphone held close [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Seedance 2.0, Environment Transition Anti-Physics. Before the helmet, the voice is clear [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: The human evaluation interface used to collect the labels in Section 3.3. The header [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
read the original abstract

Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench tests models across three scene categories: Steady State, Event Transition, and Environment Transition. It covers physics-grounded subcategories drawn from real-world scenes, plus Anti-AV-Physics prompts that deliberately request physically inconsistent audio-video behavior. Each generation is evaluated along five dimensions: visual semantic adherence, audio semantic adherence, visual physical commonsense, audio physical commonsense, and cross-modal physical commonsense. Across three proprietary and four open-source models, we find that Seedance 2.0 performs best overall, but all models remain far from robust physical understanding. Performance drops sharply on event-driven and environment-driven transitions, and even strong proprietary systems collapse on Anti-AV-Physics prompts. We further introduce AV-Phys Agent, a ReAct-style evaluator that combines a multimodal language model with deterministic acoustic measurement tools, producing rankings that closely align with human ratings. Our results identify cross-modal physical consistency and transition-driven scene dynamics as key open challenges for joint audio-video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The paper introduces AV-Phys Bench, a benchmark for evaluating physical commonsense in joint audio-video generation models. It categorizes scenes into Steady State, Event Transition, and Environment Transition, includes Anti-AV-Physics prompts, and scores generations along five dimensions (visual/audio semantic adherence, visual/audio physical commonsense, cross-modal physical commonsense) using the AV-Phys Agent (a ReAct-style multimodal LLM augmented with deterministic acoustic tools). Experiments on three proprietary and four open-source models show Seedance 2.0 performing best overall, with sharp drops on transition scenes and Anti-AV-Physics prompts, and report that agent rankings align with human ratings. The central claim is that current models lack robust audio-visual physical understanding.

Significance. If the benchmark and agent are shown to be reliable, the work would usefully identify cross-modal consistency and transition dynamics as open challenges for joint AV generation, providing a concrete evaluation framework that could steer future model training and architecture choices in multimodal synthesis.

major comments (4)
  1. [Section 3] Section 3 (Benchmark Construction): The description of prompt sourcing, subcategory selection from real-world scenes, and exact criteria for Anti-AV-Physics prompts lacks sufficient detail on curation process, diversity metrics, or potential biases, which directly affects the validity of the reported performance drops on Event/Environment Transitions.
  2. [Section 4.2] Section 4.2 (Evaluation Dimensions): The five dimensions are presented as jointly measuring physical understanding, yet no quantitative grounding (e.g., comparison of generated clips against physics simulators for trajectory conservation or wave propagation) or ablation is provided to show they are necessary and sufficient; this is load-bearing for the claim that all models 'remain far from robust physical understanding.'
  3. [Section 5] Section 5 (Results and Agent Validation): While agent-human alignment is claimed, no inter-rater agreement statistics, statistical significance tests on model differences, or confidence intervals are reported; the assertion that Seedance 2.0 is best overall and that proprietary models collapse on Anti-AV-Physics prompts therefore rests on unquantified differences.
  4. [Section 4.3] Section 4.3 (AV-Phys Agent): The ReAct-style agent relies on LLM judgments for visual/audio physical commonsense and cross-modal consistency without external physics-based verification (e.g., extracting 3D motion or acoustic spectra and checking against conservation equations), leaving open whether low scores reflect genuine violations or evaluator heuristics.
minor comments (2)
  1. [Section 4] Notation for the five evaluation dimensions is introduced without a clear summary table, making it hard to track which dimension drives the reported failures on transitions.
  2. [Abstract and Section 1] The abstract and introduction use 'robust physical understanding' without a precise operational definition tied to the benchmark, which should be clarified early.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and outlining planned revisions to enhance the rigor, transparency, and validity of the benchmark and evaluations.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Benchmark Construction): The description of prompt sourcing, subcategory selection from real-world scenes, and exact criteria for Anti-AV-Physics prompts lacks sufficient detail on curation process, diversity metrics, or potential biases, which directly affects the validity of the reported performance drops on Event/Environment Transitions.

    Authors: We agree that expanded details on curation will strengthen the paper's validity. In the revised manuscript, we will augment Section 3 with: (1) explicit sources for real-world scene subcategories (e.g., referenced physics video corpora and textbooks), (2) quantitative diversity metrics including counts of unique physics phenomena, scene complexity distributions, and subcategory balance, and (3) precise criteria for Anti-AV-Physics prompts (e.g., deliberate violation of conservation laws while preserving semantic coherence). We will also add a paragraph on bias mitigation strategies, such as prompt length balancing and expert review for over-representation of object categories. These changes directly support the reported drops on transition scenes. revision: yes

  2. Referee: [Section 4.2] Section 4.2 (Evaluation Dimensions): The five dimensions are presented as jointly measuring physical understanding, yet no quantitative grounding (e.g., comparison of generated clips against physics simulators for trajectory conservation or wave propagation) or ablation is provided to show they are necessary and sufficient; this is load-bearing for the claim that all models 'remain far from robust physical understanding.'

    Authors: We acknowledge the value of additional grounding. The dimensions were logically decomposed from physical commonsense literature into visual/audio/cross-modal components to enable fine-grained diagnosis. Full simulator comparisons are impractical for the benchmark's open-ended, diverse scenes. In revision, we will add to Section 4.2: explicit rationale with illustrative examples for each dimension, an ablation study on a representative subset demonstrating their individual contributions to aggregate scores, and an explicit limitations paragraph noting the absence of simulator-based verification. This will better substantiate the central claim while highlighting avenues for future work. revision: partial

  3. Referee: [Section 5] Section 5 (Results and Agent Validation): While agent-human alignment is claimed, no inter-rater agreement statistics, statistical significance tests on model differences, or confidence intervals are reported; the assertion that Seedance 2.0 is best overall and that proprietary models collapse on Anti-AV-Physics prompts therefore rests on unquantified differences.

    Authors: We agree that statistical quantification is essential for robust claims. In the revised Section 5, we will report: inter-rater agreement (e.g., Fleiss' kappa across human evaluators), statistical significance tests (e.g., Wilcoxon signed-rank tests with p-values) for all model pairwise differences, and 95% confidence intervals on the per-dimension and aggregate scores. These will be incorporated into the text, tables, and figures. This will provide quantitative backing for Seedance 2.0's ranking and the observed collapses on Anti-AV-Physics prompts. revision: yes

  4. Referee: [Section 4.3] Section 4.3 (AV-Phys Agent): The ReAct-style agent relies on LLM judgments for visual/audio physical commonsense and cross-modal consistency without external physics-based verification (e.g., extracting 3D motion or acoustic spectra and checking against conservation equations), leaving open whether low scores reflect genuine violations or evaluator heuristics.

    Authors: We appreciate this point on verification. The agent hybridizes multimodal LLM reasoning with deterministic acoustic tools (e.g., spectral analysis for wave propagation checks) to ground audio evaluations. Comprehensive external physics simulation (3D trajectories, conservation equations) is not scalable across the benchmark's varied creative scenes without per-scene custom simulators. In revision, we will expand Section 4.3 to: detail the acoustic tool implementations with examples, describe prompting strategies that explicitly reference physical principles, and include a dedicated limitations discussion on potential LLM heuristics alongside suggestions for simulator integration in future extensions. The human alignment results provide empirical support for the current approach. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark evaluation

full rationale

The paper is an empirical benchmark study that introduces AV-Phys Bench and AV-Phys Agent to evaluate joint audio-video generation models on physical commonsense. There are no mathematical derivations, equations, fitted parameters, or first-principles predictions that reduce to self-defined inputs. Central claims rest on observed performance differences across models, scene categories, and prompt types, with the agent explicitly validated by alignment to human ratings. No load-bearing steps rely on self-citation chains, ansatzes, or renaming of known results; the work is self-contained as an experimental evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that the chosen scene categories and evaluation dimensions capture physical commonsense; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Physical commonsense in audio-video pairs can be decomposed into the five listed adherence and consistency dimensions
    Invoked when defining the evaluation protocol in the abstract.
invented entities (2)
  • AV-Phys Bench no independent evidence
    purpose: Benchmark for physical commonsense in joint AV generation
    Newly introduced test suite; no independent evidence outside the paper.
  • AV-Phys Agent no independent evidence
    purpose: ReAct-style automated evaluator combining MLLM with acoustic tools
    Newly introduced tool; no independent evidence outside the paper.

pith-pipeline@v0.9.0 · 5562 in / 1293 out tokens · 28820 ms · 2026-05-11T02:08:27.608016+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 10 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  2. [2]

    Bansal, Z

    Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024

  3. [3]

    Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

    Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

  4. [4]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

  5. [5]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  6. [6]

    T2av-compass: Towards unified evaluation for text-to-audio-video generation.arXiv preprint arXiv:2512.21094, 2025

    Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang, Jialu Chen, Miao Deng, Jiahao Wang, Yubin Guo, Chenxi Liao, et al. T2av-compass: Towards unified evaluation for text-to-audio-video generation.arXiv preprint arXiv:2512.21094, 2025

  7. [7]

    Soundspaces 2.0: A simulation platform for visual-acoustic learning.Advances in Neural Information Processing Systems, 35:8896– 8911, 2022

    Changan Chen, Carl Schissler, Sanchit Garg, Philip Kobernik, Alexander Clegg, Paul Calamia, Dhruv Batra, Philip Robinson, and Kristen Grauman. Soundspaces 2.0: A simulation platform for visual-acoustic learning.Advances in Neural Information Processing Systems, 35:8896– 8911, 2022

  8. [8]

    Speed by simplicity: A single-stream architecture for fast audio-video generative foundation model.arXiv preprint arXiv:2603.21986, 2026

    Ethan Chern, Hansi Teng, Hanwen Sun, Hao Wang, Hong Pan, Hongyu Jia, Jiadi Su, Jin Li, Junjie Yu, Lijie Liu, et al. Speed by simplicity: A single-stream architecture for fast audio-video generative foundation model.arXiv preprint arXiv:2603.21986, 2026

  9. [9]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  10. [10]

    Introducing Veo 3.1 and advanced capa- bilities in Flow

    Jess Gallegos, Thomas Iljic, and Google DeepMind. Introducing Veo 3.1 and advanced capa- bilities in Flow. https://blog.google/technology/ai/veo-updates-flow/, October

  11. [11]

    Technical details inherited from the Veo 3 Tech Re- port,https://storage.googleapis.com/deepmind-media/veo/Veo-3-Tech-Report

    Google Blog, October 15, 2025. Technical details inherited from the Veo 3 Tech Re- port,https://storage.googleapis.com/deepmind-media/veo/Veo-3-Tech-Report. pdf

  12. [12]

    Look, listen, and act: Towards audio-visual embodied navigation

    Chuang Gan, Yiwei Zhang, Jiajun Wu, Boqing Gong, and Joshua B Tenenbaum. Look, listen, and act: Towards audio-visual embodied navigation. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9701–9707. IEEE, 2020

  13. [13]

    phyworldbench

    Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, et al. " phyworldbench": A comprehensive evaluation of physical realism in text-to-video models.arXiv preprint arXiv:2507.13428, 2025

  14. [14]

    T2vphysbench: A first-principles benchmark for physical consistency in text-to-video generation.arXiv preprint arXiv:2505.00337, 2025

    Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, and Jiale Zhao. T2vphysbench: A first-principles benchmark for physical consistency in text-to-video gen- eration.arXiv preprint arXiv:2505.00337, 2025

  15. [15]

    LTX-2: Efficient Joint Audio-Visual Foundation Model

    Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026. 10

  16. [16]

    Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

    Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2105–2123, 2024

  17. [17]

    Clipscore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021

  18. [18]

    VABench: A Comprehensive Benchmark for Audio-Video Generation

    Daili Hua, Xizhi Wang, Bohan Zeng, Xinyi Huang, Hao Liang, Junbo Niu, Xinlong Chen, Quanqing Xu, and Wentao Zhang. Vabench: A comprehensive benchmark for audio-video generation.arXiv preprint arXiv:2512.09299, 2025

  19. [19]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  20. [20]

    A reference-free metric for evaluating music enhancement algorithms

    K Kilgour, M Zuluaga, D Roblek, and M Sharifi. A reference-free metric for evaluating music enhancement algorithms. Interspeech, 2019

  21. [21]

    The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

    J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

  22. [22]

    Video generation models: A survey of post-training and alignment

    Chaoyu Li, Xiaoyi Gu, Yogesh Kulkarni, Eun Woo Im, Mohammadmahdi Honarmand, Zeyu Wang, Juntong Song, Fei Du, Xilin Jiang, Kexin Zheng, et al. Video generation models: A survey of post-training and alignment. 2026

  23. [23]

    Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025

    Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Jiebo Luo, Ziwei Liu, Hao Fei, et al. Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025

  24. [24]

    Javisdit++: Unified modeling and optimization for joint audio-video generation.arXiv preprint arXiv:2602.19163, 2026

    Kai Liu, Yanhao Zheng, Kai Wang, Shengqiong Wu, Rongjunchen Zhang, Jiebo Luo, Dimitrios Hatzinakos, Ziwei Liu, Hao Fei, and Tat-Seng Chua. Javisdit++: Unified modeling and optimization for joint audio-video generation.arXiv preprint arXiv:2602.19163, 2026

  25. [25]

    Caven: An embodied conversational agent for efficient audio-visual navigation in noisy environments

    Xiulong Liu, Sudipta Paul, Moitreya Chatterjee, and Anoop Cherian. Caven: An embodied conversational agent for efficient audio-visual navigation in noisy environments. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 3765–3773, 2024

  26. [26]

    Ovi: Twin backbone cross-modal fusion for audio-video generation

    Chetwin Low, Weimin Wang, and Calder Katyal. Ovi: Twin backbone cross-modal fusion for audio-video generation.arXiv preprint arXiv:2510.01284, 2025

  27. [27]

    Tavgbench: Benchmarking text to audible-video generation

    Yuxin Mao, Xuyang Shen, Jing Zhang, Zhen Qin, Jinxing Zhou, Mochu Xiang, Yiran Zhong, and Yuchao Dai. Tavgbench: Benchmarking text to audible-video generation. InProceedings of the 32nd ACM International Conference on Multimedia, pages 6607–6616, 2024

  28. [28]

    Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363,

    Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024

  29. [29]

    Do gener- ative video models understand physical principles? InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 948–958, 2026

    Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do gener- ative video models understand physical principles? InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 948–958, 2026

  30. [30]

    Sora 2, 2025

    OpenAI. Sora 2, 2025. Accessed: 2026-05-04

  31. [31]

    Seedance 2.0: Advancing Video Generation for World Complexity

    Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

  32. [32]

    Savgbench: Benchmarking spatially aligned audio-video generation

    Kazuki Shimada, Christian Simon, Takashi Shibuya, Shusuke Takahashi, and Yuki Mitsufuji. Savgbench: Benchmarking spatially aligned audio-video generation. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 11977–11981. IEEE, 2026. 11

  33. [33]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  34. [34]

    T2v- compbench: A comprehensive benchmark for compositional text-to-video generation

    Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v- compbench: A comprehensive benchmark for compositional text-to-video generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 8406–8416, 2025

  35. [35]

    Sonicbench: Dissecting the physical perception bottleneck in large audio language models.arXiv preprint arXiv:2601.11039, 2026

    Yirong Sun, Yanjun Chen, Xin Qiu, Gang Zhang, Hongyu Chen, Daokuan Wu, Chengming Li, Min Yang, Dawei Zhu, Wei Zhang, et al. Sonicbench: Dissecting the physical perception bottleneck in large audio language models.arXiv preprint arXiv:2601.11039, 2026

  36. [36]

    CoRR , volume =

    Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

  37. [37]

    Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

  38. [38]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

  39. [39]

    Universe-1: Unified audio-video generation via stitching of experts.arXiv preprint arXiv:2509.06155, 2025

    Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, and Gang Yu. Universe-1: Unified audio-video generation via stitching of experts.arXiv preprint arXiv:2509.06155, 2025

  40. [40]

    PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

    Tianxin Xie, Wentao Lei, Kai Jiang, Guanjie Huang, Pengfei Zhang, Chunhui Zhang, Fengji Ma, Haoyu He, Han Zhang, Jiangshan He, et al. Phyavbench: A challenging audio physics- sensitivity benchmark for physically grounded text-to-audio-video generation.arXiv preprint arXiv:2512.23994, 2025

  41. [41]

    A survey on video diffusion models.ACM Computing Surveys, 57(2):1–42, 2024

    Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video diffusion models.ACM Computing Surveys, 57(2):1–42, 2024

  42. [42]

    A Systematic Post-Train Framework for Video Generation

    Zeyue Xue, Siming Fu, Jie Huang, Shuai Lu, Haoran Li, Yijun Liu, Yuming Li, Xiaoxuan He, Mengzhao Chen, Haoyang Huang, et al. A systematic post-train framework for video generation.arXiv preprint arXiv:2604.25427, 2026

  43. [43]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

  44. [44]

    Diverse and aligned audio-to-video generation via text-to-video model adaptation

    Guy Yariv, Itai Gat, Sagie Benaim, Lior Wolf, Idan Schwartz, and Yossi Adi. Diverse and aligned audio-to-video generation via text-to-video model adaptation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 6639–6647, 2024

  45. [45]

    A mallet strikes a xylophone from the longest bar to the shortest. Each bar rings higher than the one before, from deep warm tones to bright high notes

    Juan Zhang, Jiahao Chen, Cheng Wang, Zhiwang Yu, Tangquan Qi, Can Liu, and Di Wu. Virbo: Multimodal multilingual avatar video generation in digital marketing.arXiv preprint arXiv:2403.11700, 2024. 12 A Broader Impact and Limitations Broader impact.A V-Phys Bench provides the first systematic diagnostic for where joint audio-video models fail on physics, o...

  46. [46]

    video_sa.objects— Are all of the following visually present in the clip: {video.objects}? Answer Yes or No

  47. [47]

    {video.event}

    video_sa.event— Is the event “{video.event}” visually depicted in the clip? Answer Yes or No

  48. [48]

    would normally be audible if real-world physics held; answer Yes if they are appropriately represented as such (typically silent here)

    audio_sa.objects— Are the sound source(s) {audio.objects} audible in the clip?(when silence_expected: “would normally be audible if real-world physics held; answer Yes if they are appropriately represented as such (typically silent here)”)

  49. [49]

    the clip is expected to be silent during the depicted event; answer Yes if it is appropriately silent throughout with no audible leak-through

    audio_sa.sound— Is the sound {audio.sound} clearly audible in the clip?(when silence_expected: “the clip is expected to be silent during the depicted event; answer Yes if it is appropriately silent throughout with no audible leak-through”) Key Standards (Physical Commonsense) Check whether each of the following physics statements is true of the clip. Answ...