VideoPhy benchmark shows state-of-the-art text-to-video models follow physical commonsense and text prompts in only 39.6% of cases for the best model.
Advancing high-resolution video-language representation with large-scale video transcriptions
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 3roles
background 2representative citing papers
Inferix provides an optimized inference engine for semi-autoregressive block-diffusion decoding to support high-quality, variable-length video generation in world simulation applications.
ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.
citing papers explorer
-
VideoPhy: Evaluating Physical Commonsense for Video Generation
VideoPhy benchmark shows state-of-the-art text-to-video models follow physical commonsense and text prompts in only 39.6% of cases for the best model.
-
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation
Inferix provides an optimized inference engine for semi-autoregressive block-diffusion decoding to support high-quality, variable-length video generation in world simulation applications.
-
ModelScope Text-to-Video Technical Report
ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.