Physics-IQ benchmark reveals that generative video models exhibit limited physical understanding unrelated to their visual quality.
Grasp: A novel benchmark for evaluating language grounding and situated physics understand- ing in multimodal language models.arXiv preprint arXiv:2311.09048, 2023
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
LiveK12Bench is a growing multi-disciplinary benchmark showing LMMs like GPT-5 drop from 79 to 53 under realistic exam constraints including process rigor and efficiency.
VisPhyWorld evaluates MLLMs' physical reasoning via executable code generation for video reconstruction, with VisPhyBench showing strong semantics but weak parameter inference and dynamics simulation.
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
PhyGenBench supplies 160 prompts across 27 physical laws and an automated LLM/VLM evaluation pipeline to measure physical commonsense compliance in current text-to-video models.
citing papers explorer
-
Do generative video models understand physical principles?
Physics-IQ benchmark reveals that generative video models exhibit limited physical understanding unrelated to their visual quality.
-
LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?
LiveK12Bench is a growing multi-disciplinary benchmark showing LMMs like GPT-5 drop from 79 to 53 under realistic exam constraints including process rigor and efficiency.
-
VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction
VisPhyWorld evaluates MLLMs' physical reasoning via executable code generation for video reconstruction, with VisPhyBench showing strong semantics but weak parameter inference and dynamics simulation.
-
Video models are zero-shot learners and reasoners
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
-
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
PhyGenBench supplies 160 prompts across 27 physical laws and an automated LLM/VLM evaluation pipeline to measure physical commonsense compliance in current text-to-video models.