pith. sign in

arxiv: 2406.03520 · v2 · pith:QZHGDGY5new · submitted 2024-06-05 · 💻 cs.CV · cs.AI· cs.LG

VideoPhy: Evaluating Physical Commonsense for Video Generation

Pith reviewed 2026-05-20 11:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords text-to-video generationphysical commonsensebenchmarkvideo evaluationgenerative modelshuman evaluation
0
0 comments X

The pith

Text-to-video models generate videos that follow both captions and physical laws in fewer than 40 percent of cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VideoPhy, a benchmark of text prompts that describe everyday physical interactions between solids, fluids, and other materials. Researchers generate videos from multiple current models and ask human evaluators to check whether each video matches the prompt and also obeys real-world physics such as rolling, pouring, or breaking. Even the strongest model tested, CogVideoX-5B, succeeds on both criteria only 39.6 percent of the time. The results indicate that present generators are still far from functioning as accurate simulators of the physical world. The authors also release an automated scorer, VideoCon-Physics, to evaluate future models more quickly.

Core claim

VideoPhy reveals that existing text-to-video generative models severely lack the ability to generate videos adhering to the given text prompts while also lacking physical commonsense, with the best model succeeding on only 39.6 percent of instances.

What carries the argument

The VideoPhy benchmark, which supplies diverse prompts involving material-type interactions and measures success via human judgment of caption adherence plus physical-law compliance.

If this is right

  • Video generative models remain far from accurately simulating the physical world.
  • Progress on future models can be tracked with the released VideoPhy prompts and protocol.
  • The automated VideoCon-Physics evaluator can be applied to newly released models without repeated human studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improved performance on VideoPhy could make generated videos more usable for planning tasks that require realistic motion.
  • Weak results on fluid-solid interactions may indicate specific gaps that targeted training data or loss terms could address.
  • The same curation approach could be extended to create benchmarks for other forms of commonsense such as object permanence or causal chains.

Load-bearing premise

Human evaluators can reliably and consistently judge whether a generated video follows physical commonsense for the curated prompts.

What would settle it

A new model that produces videos judged by humans to follow both the prompt and physical laws on more than 70 percent of VideoPhy instances would weaken the claim that current generators lack physical commonsense.

read the original abstract

Recent advances in internet-scale video data pretraining have led to the development of text-to-video generative models that can create high-quality videos across a broad range of visual concepts, synthesize realistic motions and render complex objects. Hence, these generative models have the potential to become general-purpose simulators of the physical world. However, it is unclear how far we are from this goal with the existing text-to-video generative models. To this end, we present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities (e.g. marbles will roll down when placed on a slanted surface). Specifically, we curate diverse prompts that involve interactions between various material types in the physical world (e.g., solid-solid, solid-fluid, fluid-fluid). We then generate videos conditioned on these captions from diverse state-of-the-art text-to-video generative models, including open models (e.g., CogVideoX) and closed models (e.g., Lumiere, Dream Machine). Our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts, while also lack physical commonsense. Specifically, the best performing model, CogVideoX-5B, generates videos that adhere to the caption and physical laws for 39.6% of the instances. VideoPhy thus highlights that the video generative models are far from accurately simulating the physical world. Finally, we propose an auto-evaluator, VideoCon-Physics, to assess the performance reliably for the newly released models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces VideoPhy, a benchmark for assessing physical commonsense in text-to-video generative models. It curates prompts involving material interactions (solid-solid, solid-fluid, fluid-fluid), generates videos from open and closed SOTA models (e.g., CogVideoX-5B, Lumiere), and reports human evaluation results showing that even the best model adheres to both the caption and physical laws in only 39.6% of cases. The work also proposes an automatic evaluator, VideoCon-Physics, for scalable assessment of future models.

Significance. If the human evaluation results hold, this benchmark provides concrete evidence of a substantial gap in current video generation models' ability to simulate real-world physics, which is important for their potential use as general-purpose simulators. The direct use of human judgments on curated physical interactions supplies falsifiable, model-agnostic evidence rather than relying on self-referential metrics. The proposal of VideoCon-Physics is a constructive addition for reproducibility and future work. The evaluation across both open and closed models and the focus on diverse material-type interactions are particular strengths.

major comments (2)
  1. [Human Evaluation] Human evaluation protocol: the central 39.6% figure for CogVideoX-5B (and all other reported percentages) is presented without inter-annotator agreement statistics (e.g., Fleiss' kappa or pairwise agreement) or error bars on the physical-laws label. Because the claim that models 'severely lack' physical commonsense rests directly on these human judgments, the absence of agreement data makes it difficult to separate model failure from annotator variance.
  2. [Benchmark Construction] Prompt curation and validation: the description of how prompts were selected and verified to test genuine physical commonsense (rather than ambiguous or underspecified cases) remains high-level. More detail on the curation process, including any expert review or pilot testing for physical accuracy, would be needed to establish that the benchmark instances are load-bearing tests of the claimed capability gap.
minor comments (3)
  1. [Abstract] The abstract and results sections use the phrase 'severely lack' for the 39.6% figure; a more precise statement of the quantitative gap would improve tone and clarity.
  2. [Results] Table or figure presenting per-category breakdown (solid-solid vs. fluid-fluid, etc.) would help readers assess whether failures are uniform or concentrated in particular interaction types.
  3. [Auto-Evaluator] The auto-evaluator VideoCon-Physics is introduced but its correlation with human judgments and any ablation on its training data are not detailed enough for independent reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our paper. We address the major comments below and plan to incorporate revisions to improve the clarity and rigor of our human evaluation and benchmark construction sections.

read point-by-point responses
  1. Referee: [Human Evaluation] Human evaluation protocol: the central 39.6% figure for CogVideoX-5B (and all other reported percentages) is presented without inter-annotator agreement statistics (e.g., Fleiss' kappa or pairwise agreement) or error bars on the physical-laws label. Because the claim that models 'severely lack' physical commonsense rests directly on these human judgments, the absence of agreement data makes it difficult to separate model failure from annotator variance.

    Authors: We agree that reporting inter-annotator agreement is important for validating the reliability of our human evaluation results. In the revised manuscript, we will include Fleiss' kappa scores for the annotations on physical adherence and caption adherence. Additionally, we will provide error bars or confidence intervals for the reported percentages to better quantify the variability in the human judgments. This will help demonstrate that the observed low performance is indeed due to model limitations rather than annotator disagreement. revision: yes

  2. Referee: [Benchmark Construction] Prompt curation and validation: the description of how prompts were selected and verified to test genuine physical commonsense (rather than ambiguous or underspecified cases) remains high-level. More detail on the curation process, including any expert review or pilot testing for physical accuracy, would be needed to establish that the benchmark instances are load-bearing tests of the claimed capability gap.

    Authors: We thank the referee for this suggestion. In the original manuscript, we provided a high-level overview of the prompt curation to maintain focus on the evaluation results. However, we acknowledge that additional details would enhance the reproducibility and credibility of the benchmark. In the revised version, we will expand the section on benchmark construction to include more specifics on the prompt selection criteria, the process of verifying physical accuracy through pilot studies, and any expert consultations or reviews conducted to ensure the prompts test genuine physical commonsense without ambiguity. revision: yes

Circularity Check

0 steps flagged

No circularity in benchmark evaluation or auto-evaluator proposal

full rationale

The paper curates a set of text prompts involving physical interactions across material types and evaluates outputs from existing text-to-video models via human judgment on adherence to both captions and physical laws. No equations, parameter fitting, or first-principles derivations are claimed; the 39.6% figure for CogVideoX-5B is a direct empirical count from external model generations and annotator labels. The proposed VideoCon-Physics auto-evaluator is introduced as a new tool without reducing to any self-citation chain or redefinition of inputs. All load-bearing steps rely on independent human evaluation protocols and publicly available generative models rather than internal consistency loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the curated prompts adequately sample physical commonsense and that human raters apply consistent criteria; no free parameters are fitted, no new entities are postulated, and axioms are standard assumptions about physical laws and evaluation validity.

axioms (1)
  • domain assumption Human raters can accurately detect violations of physical commonsense in short video clips.
    Invoked in the human evaluation section to interpret the 39.6% adherence rate.

pith-pipeline@v0.9.0 · 5843 in / 1222 out tokens · 36597 ms · 2026-05-20T11:30:01.354328+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PhyGround: Benchmarking Physical Reasoning in Generative World Models

    cs.CV 2026-05 accept novelty 7.0

    PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.

  2. Do Joint Audio-Video Generation Models Understand Physics?

    cs.SD 2026-05 unverdicted novelty 7.0

    Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.

  3. CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection

    cs.CV 2026-05 unverdicted novelty 7.0

    CMTA detects AI-generated videos by capturing unnatural temporal stability in visual-textual semantic alignment via joint embeddings and multi-grained temporal modeling, outperforming prior methods in cross-generator tests.

  4. MoRight: Motion Control Done Right

    cs.CV 2026-04 unverdicted novelty 7.0

    MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...

  5. PlayWorld: Learning Robot World Models from Autonomous Play

    cs.RO 2026-03 unverdicted novelty 7.0

    PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy p...

  6. VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?

    cs.CV 2025-12 unverdicted novelty 7.0

    VideoASMR-Bench shows state-of-the-art VLMs fail to reliably detect AI-generated ASMR videos from real ones, though humans can still identify the fakes relatively easily.

  7. DreamGen: Unlocking Generalization in Robot Learning through Video World Models

    cs.RO 2025-05 unverdicted novelty 7.0

    DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperatio...

  8. NEWTON: Agentic Planning for Physically Grounded Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.

  9. Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos

    cs.CV 2026-05 unverdicted novelty 6.0

    MIGA introduces two-stage alignment to close train-inference gaps and dual consistency enhancement via self-reflection and long-range guidance to achieve SOTA temporal consistency in infinite-frame video generation on...

  10. PanoWorld: Geometry-Consistent Panoramic Video World Modeling

    cs.CV 2026-05 unverdicted novelty 6.0

    PanoWorld adds depth consistency and trajectory consistency losses plus spherical adaptations to a pre-trained video model, plus a new PanoGeo dataset, to produce geometry-consistent 360 video.

  11. Quantitative Video World Model Evaluation for Geometric-Consistency

    cs.CV 2026-05 unverdicted novelty 6.0

    PDI-Bench computes 3D projective residuals from segmented and tracked points to quantify geometric inconsistency in AI-generated videos.

  12. How Far Are Video Models from True Multimodal Reasoning?

    cs.CV 2026-04 unverdicted novelty 6.0

    Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

  13. ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity

    cs.CV 2026-04 unverdicted novelty 6.0

    ATSS detects AI-generated videos by measuring unnatural repetitive temporal correlations in triple similarity matrices derived from frame visuals and semantic descriptions.

  14. RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

    cs.CV 2025-10 unverdicted novelty 6.0

    RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.

  15. Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

    cs.RO 2025-07 unverdicted novelty 6.0

    RIGVid shows that filtered AI-generated videos can serve as effective supervision for complex robotic manipulation tasks without any real demonstrations.

  16. MAGI-1: Autoregressive Video Generation at Scale

    cs.CV 2025-05 unverdicted novelty 6.0

    MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

  17. Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

    cs.CV 2024-10 unverdicted novelty 6.0

    PhyGenBench supplies 160 prompts across 27 physical laws and an automated LLM/VLM evaluation pipeline to measure physical commonsense compliance in current text-to-video models.

  18. Actionable World Representation

    cs.AI 2026-05 unverdicted novelty 4.0

    WorldString is a fully differentiable neural model for representing actionable object states learned from 3D sensor data.

  19. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  20. World Simulation with Video Foundation Models for Physical AI

    cs.CV 2025-10 unverdicted novelty 4.0

    Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

Reference graph

Works this paper leans on

135 extracted references · 135 canonical work pages · cited by 20 Pith papers · 20 internal anchors

  1. [1]

    Luma Dream Machine | AI Video Generator — lumalabs.ai

    Luma AI. Luma Dream Machine | AI Video Generator — lumalabs.ai. https://lumalabs. ai/dream-machine, 2024

  2. [2]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021

  3. [3]

    Videocon: Robust video-language alignment via contrast captions

    Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, and Aditya Grover. Videocon: Robust video-language alignment via contrast captions. arXiv preprint arXiv:2311.10111, 2023

  4. [4]

    Talc: Time-aligned captions for multi-scene text-to-video generation

    Hritik Bansal, Yonatan Bitton, Michal Yarom, Idan Szpektor, Aditya Grover, and Kai-Wei Chang. Talc: Time-aligned captions for multi-scene text-to-video generation. arXiv preprint arXiv:2405.04682, 2024

  5. [5]

    Comparing bad apples to good oranges: Aligning large language models via joint preference optimization

    Hritik Bansal, Ashima Suvarna, Gantavya Bhatt, Nanyun Peng, Kai-Wei Chang, and Aditya Grover. Comparing bad apples to good oranges: Aligning large language models via joint preference optimization. arXiv preprint arXiv:2404.00530, 2024

  6. [6]

    How well can text-to- image generative models understand ethical natural language interventions? arXiv preprint arXiv:2210.15230, 2022

    Hritik Bansal, Da Yin, Masoud Monajatipoor, and Kai-Wei Chang. How well can text-to- image generative models understand ethical natural language interventions? arXiv preprint arXiv:2210.15230, 2022

  7. [7]

    Lumiere: A space-time diffusion model for video generation

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024

  8. [8]

    An introduction to physically based modeling: rigid body simulation i—unconstrained rigid body dynamics

    David Baraff. An introduction to physically based modeling: rigid body simulation i—unconstrained rigid body dynamics. SIGGRAPH course notes, 82, 1997

  9. [9]

    A fast variational framework for accurate solid-fluid coupling

    Christopher Batty, Florence Bertails, and Robert Bridson. A fast variational framework for accurate solid-fluid coupling. ACM Transactions on Graphics (TOG), 26(3):100–es, 2007

  10. [10]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

  11. [11]

    Visit-bench: A benchmark for vision-language instruction following inspired by real-world use

    Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schmidt. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595, 2023

  12. [13]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

  13. [14]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023

  14. [15]

    Fluid simulation for computer graphics

    Robert Bridson. Fluid simulation for computer graphics. AK Peters/CRC Press, 2015

  15. [16]

    Generating long videos of dynamic scenes

    Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei Efros, and Tero Karras. Generating long videos of dynamic scenes. Advances in Neural Information Processing Systems, 35:31769–31781, 2022. 12

  16. [17]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

  17. [18]

    Genie: Generative interactive environments

    Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. arXiv preprint arXiv:2402.15391, 2024

  18. [19]

    Storybench: A multifaceted benchmark for continuous story visualization

    Emanuele Bugliarello, H Hernan Moraldo, Ruben Villegas, Mohammad Babaeizadeh, Moham- mad Taghi Saffar, Han Zhang, Dumitru Erhan, Vittorio Ferrari, Pieter-Jan Kindermans, and Paul V oigtlaender. Storybench: A multifaceted benchmark for continuous story visualization. Advances in Neural Information Processing Systems, 36, 2024

  19. [20]

    cerspense/zeroscope_v2_576w · Hugging Face — huggingface.co

    cerspense. cerspense/zeroscope_v2_576w · Hugging Face — huggingface.co. https:// huggingface.co/cerspense/zeroscope_v2_576w, 2023

  20. [21]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024

  21. [22]

    Physical simulation of environmentally induced thin shell deformation

    Hsiao-Yu Chen, Arnav Sastry, Wim M van Rees, and Etienne V ouga. Physical simulation of environmentally induced thin shell deformation. ACM Transactions on Graphics (TOG), 37(4):1–13, 2018

  22. [23]

    Panda-70m: Captioning 70m videos with multiple cross-modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. arXiv preprint arXiv:2402.19479, 2024

  23. [24]

    Multi-layer thick shells

    Yunuo Chen, Tianyi Xie, Cem Yuksel, Danny Kaufman, Yin Yang, Chenfanfu Jiang, and Minchen Li. Multi-layer thick shells. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–9, 2023

  24. [25]

    Learning universal policies via text-guided video generation

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. Advances in Neural Information Processing Systems, 36, 2024

  25. [26]

    A survey on machine learning approaches for modelling intuitive physics

    Jiafei Duan, Arijit Dasgupta, Jason Fischer, and Cheston Tan. A survey on machine learning approaches for modelling intuitive physics. arXiv preprint arXiv:2202.06481, 2022

  26. [27]

    Structure and content-guided video synthesis with diffusion models

    Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Ger- manidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023

  27. [28]

    Scaling rectified flow transform- ers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024

  28. [29]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

  29. [30]

    Iq-mpm: an interface quadrature material point method for non-sticky strongly two-way coupled nonlinear solids and fluids

    Yu Fang, Ziyin Qu, Minchen Li, Xinxin Zhang, Yixin Zhu, Mridul Aanjaneya, and Chenfanfu Jiang. Iq-mpm: an interface quadrature material point method for non-sticky strongly two-way coupled nonlinear solids and fluids. ACM Transactions on Graphics (TOG), 39(4):51–1, 2020

  30. [31]

    Datacomp: In search of the next generation of multimodal datasets

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024

  31. [32]

    genmo. Genmo. Create videos and images with AI. — genmo.ai. https://www.genmo.ai/. 13

  32. [33]

    Maniskill2: A unified benchmark for generalizable manipulation skills

    Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills. arXiv preprint arXiv:2302.04659, 2023

  33. [34]

    A convex formulation of frictional contact between rigid and deformable bodies

    Xuchen Han, Joseph Masterjohn, and Alejandro Castro. A convex formulation of frictional contact between rigid and deformable bodies. IEEE Robotics and Automation Letters, 2023

  34. [35]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  35. [36]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022

  36. [37]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

  37. [38]

    Dreamphysics: Learning physical properties of dynamic 3d gaussians with video diffusion priors

    Tianyu Huang, Yihan Zeng, Hui Li, Wangmeng Zuo, and Rynson WH Lau. Dreamphysics: Learning physical properties of dynamic 3d gaussians with video diffusion priors. arXiv preprint arXiv:2406.01476, 2024

  38. [39]

    Plasticinelab: A soft-body manipulation benchmark with differentiable physics

    Zhiao Huang, Yuanming Hu, Tao Du, Siyuan Zhou, Hao Su, Joshua B Tenenbaum, and Chuang Gan. Plasticinelab: A soft-body manipulation benchmark with differentiable physics. arXiv preprint arXiv:2104.03311, 2021

  39. [40]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. arXiv preprint arXiv:2311.17982, 2023

  40. [41]

    EulerDiscreteScheduler — huggingface.co

    huggingfaceEulerDiscreteScheduler. EulerDiscreteScheduler — huggingface.co. https: //huggingface.co/docs/diffusers/en/api/schedulers/euler

  41. [42]

    Text2video-zero: Text-to-image diffusion mod- els are zero-shot video generators

    Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion mod- els are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023

  42. [43]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  43. [44]

    Drucker-prager elastoplasticity for sand animation

    Gergely Klár, Theodore Gast, Andre Pradhana, Chuyuan Fu, Craig Schroeder, Chenfanfu Jiang, and Joseph Teran. Drucker-prager elastoplasticity for sand animation. ACM Transactions on Graphics (TOG), 35(4):1–12, 2016

  44. [45]

    KLING AI — klingai.com

    KlingAI. KLING AI — klingai.com. https://www.klingai.com/, 2024

  45. [46]

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023

  46. [47]

    Smoothed particle hydrodynamics techniques for the physics based simulation of fluids and solids

    Dan Koschier, Jan Bender, Barbara Solenthaler, and Matthias Teschner. Smoothed particle hydrodynamics techniques for the physics based simulation of fluids and solids. arXiv preprint arXiv:2009.06944, 2020

  47. [48]

    Subjective-aligned dateset and metric for text-to-video quality assessment

    Tengchuan Kou, Xiaohong Liu, Zicheng Zhang, Chunyi Li, Haoning Wu, Xiongkuo Min, Guangtao Zhai, and Ning Liu. Subjective-aligned dateset and metric for text-to-video quality assessment. arXiv preprint arXiv:2403.11956, 2024

  48. [49]

    Viescore: Towards explain- able metrics for conditional image synthesis evaluation

    Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explain- able metrics for conditional image synthesis evaluation. arXiv preprint arXiv:2312.14867, 2023. 14

  49. [50]

    GitHub - LAION-AI/aesthetic-predictor: A linear estimator on top of clip to predict the aesthetic quality of pictures — github.com

    LaionAI. GitHub - LAION-AI/aesthetic-predictor: A linear estimator on top of clip to predict the aesthetic quality of pictures — github.com. https://github.com/LAION-AI/ aesthetic-predictor, 2022

  50. [51]

    Variational stokes: a unified pressure- viscosity solver for accurate viscous liquids

    Egor Larionov, Christopher Batty, and Robert Bridson. Variational stokes: a unified pressure- viscosity solver for accurate viscous liquids. ACM Transactions on Graphics (TOG), 36(4):1– 11, 2017

  51. [52]

    Aligning Text-to-Image Models using Human Feedback

    Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023

  52. [53]

    User experience rating scales with 7, 11, or 101 points: does it matter? Journal of Usability Studies, 12(2), 2017

    James R Lewis and O˘guzhan Erdinç. User experience rating scales with 7, 11, or 101 points: does it matter? Journal of Usability Studies, 12(2), 2017

  53. [54]

    Incremental potential contact: intersection- and inversion-free, large-deformation dynamics

    Minchen Li, Zachary Ferguson, Teseo Schneider, Timothy R Langlois, Denis Zorin, Daniele Panozzo, Chenfanfu Jiang, and Danny M Kaufman. Incremental potential contact: intersection- and inversion-free, large-deformation dynamics. ACM Trans. Graph., 39(4):49, 2020

  54. [55]

    Codimensional incremental potential contact

    Minchen Li, Danny M Kaufman, and Chenfanfu Jiang. Codimensional incremental potential contact. arXiv preprint arXiv:2012.04457, 2020

  55. [56]

    Aligning diffusion models by optimizing human utility

    Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Aligning diffusion models by optimizing human utility. arXiv preprint arXiv:2404.04465, 2024

  56. [57]

    Energetically consistent inelasticity for optimiza- tion time integration

    Xuan Li, Minchen Li, and Chenfanfu Jiang. Energetically consistent inelasticity for optimiza- tion time integration. ACM Transactions on Graphics (TOG), 41(4):1–16, 2022

  57. [58]

    Gpu-accelerated robotic simulation for distributed reinforcement learning

    Jacky Liang, Viktor Makoviychuk, Ankur Handa, Nuttapong Chentanez, Miles Macklin, and Dieter Fox. Gpu-accelerated robotic simulation for distributed reinforcement learning. In Conference on Robot Learning, pages 270–282. PMLR, 2018

  58. [59]

    Evaluating text-to-visual generation with image-to-text generation

    Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. arXiv preprint arXiv:2404.01291, 2024

  59. [60]

    Physics3d: Learning physical properties of 3d gaussians via video diffusion

    Fangfu Liu, Hanyang Wang, Shunyu Yao, Shengjun Zhang, Jie Zhou, and Yueqi Duan. Physics3d: Learning physical properties of 3d gaussians via video diffusion. arXiv preprint arXiv:2406.04338, 2024

  60. [61]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024

  61. [62]

    Physgen: Rigid-body physics-grounded image-to-video generation

    Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to-video generation

  62. [63]

    Evalcrafter: Benchmarking and evaluating large video generation models

    Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22139–22149, 2024

  63. [64]

    Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024

  64. [65]

    Effect of the number of response categories on the reliability and validity of rating scales

    Luis M Lozano, Eduardo García-Cueto, and José Muñiz. Effect of the number of response categories on the reliability and validity of rating scales. Methodology, 4(2):73–79, 2008

  65. [66]

    DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm- solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022. 15

  66. [67]

    Physically- aware generative network for 3d shape modeling

    Mariem Mezghanni, Malika Boulkenafed, Andre Lieutier, and Maks Ovsjanikov. Physically- aware generative network for 3d shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9330–9341, 2021

  67. [68]

    mplug-owl-video

    mplugowl. mplug-owl-video. https://github.com/X-PLUG/mPLUG-Owl/tree/main/ mPLUG-Owl/mplug_owl_video

  68. [69]

    Particle-based fluid-fluid interaction

    Matthias Müller, Barbara Solenthaler, Richard Keiser, and Markus Gross. Particle-based fluid-fluid interaction. In Proceedings of the 2005 ACM SIGGRAPH/Eurographics symposium on Computer animation, pages 237–244, 2005

  69. [70]

    Phyrecon: Physically plausible neural scene reconstruction

    Junfeng Ni, Yixin Chen, Bohan Jing, Nan Jiang, Bin Wang, Bo Dai, Yixin Zhu, Song-Chun Zhu, and Siyuan Huang. Phyrecon: Physically plausible neural scene reconstruction. arXiv preprint arXiv:2404.16666, 2024

  70. [71]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021

  71. [72]

    Graphical modeling and animation of ductile fracture

    James F O’brien, Adam W Bargteil, and Jessica K Hodgins. Graphical modeling and animation of ductile fracture. In Proceedings of the 29th annual conference on Computer graphics and interactive techniques, pages 291–294, 2002

  72. [73]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023a, 2023

  73. [74]

    Gpt-4v(ision) system card, 2023b

    OpenAI. Gpt-4v(ision) system card, 2023b. https://openai.com/research/ gpt-4v-system-card , 2023

  74. [75]

    GitHub - hpcaitech/Open-Sora: Open-Sora: Democratizing Efficient Video Pro- duction for All — github.com

    OpenSora. GitHub - hpcaitech/Open-Sora: Open-Sora: Democratizing Efficient Video Pro- duction for All — github.com. https://github.com/hpcaitech/Open-Sora, 2024

  75. [76]

    Vibe-eval: A hard evaluation suite for measuring progress of multimodal language models

    Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, et al. Vibe-eval: A hard evaluation suite for measuring progress of multimodal language models. arXiv preprint arXiv:2405.02287, 2024

  76. [77]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

  77. [78]

    Pika — pika.art

    pika. Pika — pika.art. https://pika.art/

  78. [79]

    Intuitive physics learning in a deep-learning model inspired by developmental psychology

    Luis S Piloto, Ari Weinstein, Peter Battaglia, and Matthew Botvinick. Intuitive physics learning in a deep-learning model inspired by developmental psychology. Nature human behaviour, 6(9):1257–1267, 2022

  79. [80]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023

  80. [81]

    Power plastics: A hybrid lagrangian/eulerian solver for mesoscale inelastic flows

    Ziyin Qu, Minchen Li, Yin Yang, Chenfanfu Jiang, and Fernando De Goes. Power plastics: A hybrid lagrangian/eulerian solver for mesoscale inelastic flows. ACM Transactions on Graphics (TOG), 42(6):1–11, 2023

Showing first 80 references.