pith. machine review for the scientific record. sign in

arxiv: 2605.10434 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords video generationworld simulationbenchmarkcausalitytemporal consistencyreasoning evaluationhuman preferencefuture prediction
0
0 comments X

The pith

Video generators produce clips that look realistic but routinely violate physical dynamics, causality, and object permanence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WorldReasonBench as a way to test video generators on their ability to predict consistent future states from an initial scene and action. It supplies 436 test cases annotated with questions across dynamics, causality, social logic, and information preservation. A two-part human evaluation then checks whether generated videos maintain those properties over time. Sympathetic readers would care because the results show current models still treat generation as image synthesis rather than simulation, limiting their use for planning or forecasting. The work also releases a companion preference dataset to train models that close this gap.

Core claim

WorldReasonBench reframes video generation evaluation as world-state prediction and demonstrates that models such as Seedance2.0 and Veo3.1 can generate visually convincing future videos while failing to preserve physical consistency, causal relations, or information about objects across frames.

What carries the argument

The 436 curated test cases with structured QA annotations across four reasoning dimensions, evaluated via Process-aware Reasoning Verification that diagnoses temporal and causal failures and Multi-dimensional Quality Assessment that scores reasoning quality separately from aesthetics.

If this is right

  • Training objectives must explicitly penalize violations of dynamics and causality rather than optimizing only for visual fidelity.
  • Generated videos can serve as a diagnostic for whether a system understands object permanence and action consequences.
  • Reward models trained on the released preference pairs can guide generation toward more consistent future states.
  • Persistent failures in information preservation indicate limits on using these systems for long-horizon prediction tasks.
  • The benchmark separates visual appeal from reasoning quality, allowing targeted improvements in each.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the gap holds, video generators may need hybrid architectures that combine rendering with explicit world models rather than pure diffusion or autoregressive approaches.
  • The same stress-testing logic could apply to other generative domains such as 3D scene synthesis or interactive environments.
  • High failure rates on causality questions suggest that scaling data alone will not suffice without new supervision signals for temporal logic.
  • The benchmark's release enables direct comparison of future models on the same held-out cases, reducing reliance on subjective visual inspection.

Load-bearing premise

The selected test cases and the two-part human scoring process accurately reflect genuine world-state prediction ability without systematic bias or overlooked failure modes.

What would settle it

A model that scores high on both reasoning verification and quality assessment yet produces videos that break basic physics or lose track of objects when tested on new, held-out real-world scenarios.

Figures

Figures reproduced from arXiv: 2605.10434 by Bin Wang, Haowei Zhu, Keming Wu, Ping Nie, Qijie Wang, Sicong Jiang, Sudong Wang, Wenhan Xue, Wenhu Chen, Xuan Luo, Yijing Cui, Zhiyuan Feng, Zihan Wang, Zuhao Yang.

Figure 1
Figure 1. Figure 1: Overview of WorldReasonBench. We evaluate video generators as world-state predictors: given an initial visual state and an action or instruction, the model must generate a future video whose state evolution remains physically, socially, logically, and informationally consistent. WorldReason￾Bench spans four reasoning dimensions organized into 22 concise, dimension-specific subcategories, and is paired with… view at source ↗
Figure 2
Figure 2. Figure 2: Benchmark construction pipeline. A: WorldReasonBench construction, including taxonomy-aware captioning, prompt generation, and QA generation. B: WorldRewardBench construc￾tion, including video sampling, expert scoring, preference-pair construction, and human-alignment evaluation. Reasoning taxonomy. We organize world reasoning into four high-level dimensions and 22 short, interpretable subcategories. The c… view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation pipeline. A: Process-aware Reasoning Verification, which answers struc￾tured QA pairs from generated videos and converts them into reasoning-phase diagnostics. B: Multi-dimensional Quality Assessment, which scores each video along reasoning quality, temporal consistency, and visual aesthetics for ranking and reward-model evaluation. The resulting benchmark contains approximately 6K balanced pref… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on representative reasoning cases. Visually plausible generations can still fail process-level world reasoning, while higher-scoring models better preserve the intended state transition and temporal dynamics. against the Sora2 family, suppressing Seedance2.0’s Elo. ScorePR avoids this duration mismatch and matches the human ordering up to a single one-rank swap. 4.4 WorldRewardBench:… view at source ↗
Figure 5
Figure 5. Figure 5: Human annotation interface for WorldRewardBench. Annotators see the input image, [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗
read the original abstract

Commercial video generation systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into "world simulators." Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time. We introduce WorldReasonBench, which reframes video generation evaluation as world-state prediction: given an initial state and an action, can a model generate a future video whose state evolution remains physically, socially, logically, and informationally consistent? WorldReasonBench contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories. We evaluate generated videos with a human-aligned two-part methodology: Process-aware Reasoning Verification uses structured QA and reasoning-phase diagnostics to detect temporal and causal failures, while Multi-dimensional Quality Assessment scores reasoning quality, temporal consistency, and visual aesthetics for ranking and reward modeling. We further introduce WorldRewardBench, a preference benchmark with approximately 6K expert-annotated pairs over 1.4K videos, supporting pair-wise and point-wise reward-model evaluation. Across modern video generators, our results expose a persistent gap between visual plausibility and world reasoning: videos can look convincing while failing dynamics, causality, or information preservation. We will release our benchmarks and evaluation toolkit to support community research on genuinely world-aware video generation at https://github.com/UniX-AI-Lab/WorldReasonBench/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces WorldReasonBench, a benchmark with 436 author-curated test cases and structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories, to evaluate video generators as world-state predictors given an initial state and action. It proposes a two-part human-aligned evaluation: Process-aware Reasoning Verification (using structured QA and phase diagnostics for temporal/causal failures) and Multi-dimensional Quality Assessment (scoring reasoning quality, temporal consistency, and aesthetics). The work also releases WorldRewardBench (~6K expert preference pairs over 1.4K videos) and reports a persistent gap between visual plausibility and failures in dynamics, causality, or information preservation across models like Seedance2.0 and Veo3.1. The benchmarks and toolkit are to be open-sourced.

Significance. If the benchmark validity holds, the work is significant for shifting video generation evaluation from visual fidelity toward verifiable world reasoning, which is increasingly relevant as models are positioned as simulators. Explicit strengths include the release of the full benchmark, evaluation toolkit, and preference data, which enables reproducibility, community extensions, and falsifiable testing of future claims. The structured dimensions and subcategories provide a concrete, extensible framework that could support reward modeling and model improvement beyond current visual metrics.

major comments (2)
  1. [Section 3] Benchmark construction (Section 3): The central claim of a persistent gap between visual quality and world reasoning rests on the 436 test cases accurately isolating reasoning failures. The manuscript provides no quantitative coverage analysis, diversity metrics, or inter-curator agreement for the author-curated cases and QA annotations; without these, selection bias toward easily detectable failure modes cannot be ruled out and directly undermines the robustness of the reported gap.
  2. [Section 4] Evaluation methodology (Section 4): The Process-aware Reasoning Verification and Multi-dimensional Quality Assessment rely on human judgments to detect dynamics/causality/information failures. No inter-annotator agreement statistics, correlation with automated proxies, or external validation of the two-part protocol are reported; this is load-bearing because unquantified subjectivity in reasoning-phase diagnostics could artifactually produce the claimed separation from visual plausibility.
minor comments (2)
  1. [Abstract and Section 5] Clarify the exact model versions evaluated (e.g., Seedance2.0, Veo3.1) and ensure they are consistently referenced with citations or links in the experimental setup.
  2. The GitHub release promise is welcome; adding a permanent archive link (e.g., Zenodo DOI) would strengthen long-term reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on benchmark construction and evaluation methodology. We address each major comment below and will revise the manuscript accordingly to strengthen the claims.

read point-by-point responses
  1. Referee: [Section 3] Benchmark construction (Section 3): The central claim of a persistent gap between visual quality and world reasoning rests on the 436 test cases accurately isolating reasoning failures. The manuscript provides no quantitative coverage analysis, diversity metrics, or inter-curator agreement for the author-curated cases and QA annotations; without these, selection bias toward easily detectable failure modes cannot be ruled out and directly undermines the robustness of the reported gap.

    Authors: We agree that quantitative metrics would strengthen the evidence against selection bias. In the revised manuscript, we will add: (1) coverage analysis with the distribution of the 436 test cases across the four reasoning dimensions and 22 subcategories; (2) diversity metrics including scenario variety and action types; and (3) inter-curator agreement statistics for the QA annotations, computed via Fleiss' kappa on a double-annotated subset. These will demonstrate representativeness while preserving the expert curation process described in Section 3. revision: yes

  2. Referee: [Section 4] Evaluation methodology (Section 4): The Process-aware Reasoning Verification and Multi-dimensional Quality Assessment rely on human judgments to detect dynamics/causality/information failures. No inter-annotator agreement statistics, correlation with automated proxies, or external validation of the two-part protocol are reported; this is load-bearing because unquantified subjectivity in reasoning-phase diagnostics could artifactually produce the claimed separation from visual plausibility.

    Authors: We concur that reliability metrics are essential. The revised manuscript will report inter-annotator agreement (e.g., Cohen's kappa) for both Process-aware Reasoning Verification and Multi-dimensional Quality Assessment. We will also include correlations with automated proxies such as video consistency models and expand the discussion of protocol validation steps. The structured QA annotations and phase diagnostics were designed to reduce subjectivity, but these additions will further substantiate the separation between visual plausibility and reasoning failures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark creation without reductive derivations

full rationale

The paper introduces WorldReasonBench (436 curated cases with QA annotations) and WorldRewardBench (~6K pairs) plus a two-part human evaluation protocol, then reports empirical gaps on existing video generators. No equations, first-principles derivations, fitted parameters, or predictions are claimed that could reduce to the inputs by construction. The contribution is the benchmark and its application; findings are self-contained empirical observations rather than any self-referential chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the assumption that world-state consistency can be decomposed into four measurable dimensions and assessed via curated QA pairs and human preference data; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Structured QA annotations on physical, social, logical, and informational consistency can serve as reliable ground truth for evaluating world-state prediction.
    Invoked when defining the benchmark's test cases and evaluation methodology.

pith-pipeline@v0.9.0 · 5609 in / 1251 out tokens · 38929 ms · 2026-05-12T03:34:53.254430+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 8 internal anchors

  1. [1]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. 2024. URL https://openai. com/research/video-generation-models-as-world-simulators, 3(1):3, 2024

  2. [2]

    On extending the bradley-terry model to accommodate ties in paired comparison experiments.Journal of the American Statistical Association, 65(329):317–328, 1970

    Roger R Davidson. On extending the bradley-terry model to accommodate ties in paired comparison experiments.Journal of the American Statistical Association, 65(329):317–328, 1970

  3. [3]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

  4. [4]

    Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan

    Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, et al. Videoscore2: Think before you score in generative video evaluation.arXiv preprint arXiv:2509.22799, 2025

  5. [5]

    Ruler-bench: Probing rule-based reasoning abilities of next-level video generation models for vision foundation intelligence.arXiv preprint arXiv:2512.02622, 2025

    Xuming He, Zehao Fan, Hengjia Li, Fan Zhuo, Hankun Xu, Senlin Cheng, Di Weng, Haifeng Liu, Can Ye, and Boxi Wu. Ruler-bench: Probing rule-based reasoning abilities of next-level video generation models for vision foundation intelligence.arXiv preprint arXiv:2512.02622, 2025

  6. [6]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

  7. [7]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  8. [8]

    How far is video generation from world model: A physical law perspective

    Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

  9. [9]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  10. [10]

    Beyond the last frame: Process-aware evaluation for generative video reasoning.arXiv preprint arXiv:2512.24952, 2026

    Yifan Li, Yukai Gu, Yingqian Min, Zikang Liu, Yifan Du, Kun Zhou, Min Yang, Wayne Xin Zhao, and Minghui Qiu. Viper: Process-aware evaluation for generative video reasoning.arXiv preprint arXiv:2512.24952, 2025

  11. [11]

    Can world simulators reason? Gen-ViRe: A generative visual reasoning benchmark.arXiv preprint arXiv:2511.13853, 2025

    Xinxin Liu, Zhaopan Xu, Ming Li, Kai Wang, Yong Jae Lee, and Yuzhang Shang. Can world simulators reason? gen-vire: A generative visual reasoning benchmark.arXiv preprint arXiv:2511.13853, 2025

  12. [12]

    Evalcrafter: Benchmarking and evaluating large video generation models

    Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139–22149, 2024

  13. [13]

    Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation

    Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation. Advances in Neural Information Processing Systems, 36:62352–62387, 2023. 10

  14. [14]

    V-ReasonBench: Toward unified reasoning benchmark suite for video generation models.arXiv preprint arXiv:2511.16668, 2025

    Yang Luo, Xuanlei Zhao, Baijiong Lin, Lingting Zhu, Liyao Tang, Yuqi Liu, Ying-Cong Chen, Shengju Qian, Xin Wang, and Yang You. V-reasonbench: Toward unified reasoning benchmark suite for video generation models.arXiv preprint arXiv:2511.16668, 2025

  15. [15]

    Videoeval-pro: Robust and realistic long video understanding evaluation.arXiv preprint arXiv:2505.14640, 2025

    Wentao Ma, Weiming Ren, Yiming Jia, Zhuofeng Li, Ping Nie, Ge Zhang, and Wenhu Chen. Videoeval-pro: Robust and realistic long video understanding evaluation.arXiv preprint arXiv:2505.14640, 2025

  16. [16]

    Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363,

    Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024

  17. [17]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

  18. [18]

    Worldsimbench: T owards video generation models as world simulators, 2024

    Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

  19. [19]

    T2v- compbench: A comprehensive benchmark for compositional text-to-video generation

    Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v- compbench: A comprehensive benchmark for compositional text-to-video generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 8406–8416, 2025

  20. [20]

    Qwen Team. Qwen3. 5: Accelerating productivity with native multimodal agents, 2026

  21. [21]

    Qwen Team. Qwen3. 5: Towards native multimodal agents.URL: https://qwen. ai/blog, 2026

  22. [22]

    Fvd: A new metric for video generation

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019

  23. [23]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  24. [24]

    A very big video reasoning suite

    Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, et al. A very big video reasoning suite.arXiv preprint arXiv:2602.20159, 2026

  25. [25]

    Videoverse: How far is your t2v generator from a world model?arXiv preprint arXiv:2510.08398, 2025

    Zeqing Wang, Xinyu Wei, Bairui Li, Zhen Guo, Jinrui Zhang, Hongyang Wei, Keze Wang, and Lei Zhang. Videoverse: How far is your t2v generator from a world model?arXiv preprint arXiv:2510.08398, 2025

  26. [26]

    Video models are zero-shot learners and reasoners

    Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328, 2025

  27. [27]

    Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    Keming Wu, Zuhao Yang, Kaichen Zhang, Shizun Wang, Haowei Zhu, Sicong Leng, Zhongyu Yang, Qijie Wang, Sudong Wang, Ziting Wang, et al. Visual generation in the new era: An evolution from atomic mapping to agentic world modeling.arXiv preprint arXiv:2604.28185, 2026

  28. [28]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  29. [29]

    The unrea- sonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  30. [30]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 11

  31. [31]

    Attention is all you need

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 12 A Representative Examples of WorldReason-Bench To provide a more intuitive understanding ...

  32. [32]

    3 indicates that the video captures part of the intended transition from the input image but still contains noticeable reasoning ,→errors or inconsistencies

    Reasoning Correctness ( 1 indicates that the video fails to model the intended world−state transition from the input image, violates the required ,→causal / physical / logical relation, or clearly misunderstands the instruction. 3 indicates that the video captures part of the intended transition from the input image but still contains noticeable reasoning...

  33. [33]

    3 indicates that the core content from the input image is present but continuity is only partially preserved, with visible ,→instability or abrupt state changes

    Content Fidelity & Continuity ( 1 indicates that key entities, attributes, or temporal states from the input image are missing, unstable, or incoherent across ,→frames. 3 indicates that the core content from the input image is present but continuity is only partially preserved, with visible ,→instability or abrupt state changes. 5 indicates that the requi...

  34. [34]

    reasoning

    Visual Aesthetics ( 1 indicates severe visual defects such as distortion, implausible or distracting motion, poor composition, or low rendering ,→quality. 3 indicates acceptable but imperfect visual quality with noticeable artifacts, limited appeal, or temporal presentation that is ,→only partially convincing. 5 indicates strong overall visual quality wit...

  35. [35]

    This includes causal validity, physical plausibility, logical consistency, and correct state evolution ,→over time

    Reasoning Correctness Whether the video correctly models the intended world−state transition implied by the input image together with the prompt ,→or instruction. This includes causal validity, physical plausibility, logical consistency, and correct state evolution ,→over time

  36. [36]

    Content Fidelity & Continuity Whether the required visual content from the input image is faithfully preserved and whether entities, attributes, and temporal ,→states remain coherent and continuous throughout the video

  37. [37]

    Use the following priority when making the final decision: − Reasoning Correctness is the primary criterion

    Visual Aesthetics Whether the video is visually appealing, well−rendered, naturally animated, and free from obvious artifacts or severe ,→distortions. Use the following priority when making the final decision: − Reasoning Correctness is the primary criterion. − Content Fidelity & Continuity is the secondary criterion. − Visual Aesthetics is the tertiary c...

  38. [38]

    Model A is better: [[A>B]]

  39. [39]

    Model B is better: [[B>A]]

  40. [40]

    Tie, relatively the same acceptable quality: [[A=B=Good]]

  41. [41]

    visually polished but semantically wrong

    Both are bad: [[A=B=Bad]] Table 13: Pair-wise comparison prompt for Multi-dimensional Quality Assessment. The VLM directly compares two candidate videos and outputs a preference verdict. 20 D Full Two-Component WorldReasonBench Results We separate the detailed WorldReasonBench results by evaluation component and top-level reasoning dimension. Process-awar...