arxiv: 2605.10434 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

Keming Wu , Yijing Cui , Wenhan Xue , Qijie Wang , Xuan Luo , Zhiyuan Feng , Zuhao Yang , Sudong Wang

show 6 more authors

Sicong Jiang Haowei Zhu Zihan Wang Ping Nie Wenhu Chen Bin Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generationworld simulationbenchmarkcausalitytemporal consistencyreasoning evaluationhuman preferencefuture prediction

0 comments

The pith

Video generators produce clips that look realistic but routinely violate physical dynamics, causality, and object permanence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WorldReasonBench as a way to test video generators on their ability to predict consistent future states from an initial scene and action. It supplies 436 test cases annotated with questions across dynamics, causality, social logic, and information preservation. A two-part human evaluation then checks whether generated videos maintain those properties over time. Sympathetic readers would care because the results show current models still treat generation as image synthesis rather than simulation, limiting their use for planning or forecasting. The work also releases a companion preference dataset to train models that close this gap.

Core claim

WorldReasonBench reframes video generation evaluation as world-state prediction and demonstrates that models such as Seedance2.0 and Veo3.1 can generate visually convincing future videos while failing to preserve physical consistency, causal relations, or information about objects across frames.

What carries the argument

The 436 curated test cases with structured QA annotations across four reasoning dimensions, evaluated via Process-aware Reasoning Verification that diagnoses temporal and causal failures and Multi-dimensional Quality Assessment that scores reasoning quality separately from aesthetics.

If this is right

Training objectives must explicitly penalize violations of dynamics and causality rather than optimizing only for visual fidelity.
Generated videos can serve as a diagnostic for whether a system understands object permanence and action consequences.
Reward models trained on the released preference pairs can guide generation toward more consistent future states.
Persistent failures in information preservation indicate limits on using these systems for long-horizon prediction tasks.
The benchmark separates visual appeal from reasoning quality, allowing targeted improvements in each.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the gap holds, video generators may need hybrid architectures that combine rendering with explicit world models rather than pure diffusion or autoregressive approaches.
The same stress-testing logic could apply to other generative domains such as 3D scene synthesis or interactive environments.
High failure rates on causality questions suggest that scaling data alone will not suffice without new supervision signals for temporal logic.
The benchmark's release enables direct comparison of future models on the same held-out cases, reducing reliance on subjective visual inspection.

Load-bearing premise

The selected test cases and the two-part human scoring process accurately reflect genuine world-state prediction ability without systematic bias or overlooked failure modes.

What would settle it

A model that scores high on both reasoning verification and quality assessment yet produces videos that break basic physics or lose track of objects when tested on new, held-out real-world scenarios.

Figures

Figures reproduced from arXiv: 2605.10434 by Bin Wang, Haowei Zhu, Keming Wu, Ping Nie, Qijie Wang, Sicong Jiang, Sudong Wang, Wenhan Xue, Wenhu Chen, Xuan Luo, Yijing Cui, Zhiyuan Feng, Zihan Wang, Zuhao Yang.

**Figure 1.** Figure 1: Overview of WorldReasonBench. We evaluate video generators as world-state predictors: given an initial visual state and an action or instruction, the model must generate a future video whose state evolution remains physically, socially, logically, and informationally consistent. WorldReasonBench spans four reasoning dimensions organized into 22 concise, dimension-specific subcategories, and is paired with… view at source ↗

**Figure 2.** Figure 2: Benchmark construction pipeline. A: WorldReasonBench construction, including taxonomy-aware captioning, prompt generation, and QA generation. B: WorldRewardBench construction, including video sampling, expert scoring, preference-pair construction, and human-alignment evaluation. Reasoning taxonomy. We organize world reasoning into four high-level dimensions and 22 short, interpretable subcategories. The c… view at source ↗

**Figure 3.** Figure 3: Evaluation pipeline. A: Process-aware Reasoning Verification, which answers structured QA pairs from generated videos and converts them into reasoning-phase diagnostics. B: Multi-dimensional Quality Assessment, which scores each video along reasoning quality, temporal consistency, and visual aesthetics for ranking and reward-model evaluation. The resulting benchmark contains approximately 6K balanced pref… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on representative reasoning cases. Visually plausible generations can still fail process-level world reasoning, while higher-scoring models better preserve the intended state transition and temporal dynamics. against the Sora2 family, suppressing Seedance2.0’s Elo. ScorePR avoids this duration mismatch and matches the human ordering up to a single one-rank swap. 4.4 WorldRewardBench:… view at source ↗

**Figure 5.** Figure 5: Human annotation interface for WorldRewardBench. Annotators see the input image, [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗

read the original abstract

Commercial video generation systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into "world simulators." Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time. We introduce WorldReasonBench, which reframes video generation evaluation as world-state prediction: given an initial state and an action, can a model generate a future video whose state evolution remains physically, socially, logically, and informationally consistent? WorldReasonBench contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories. We evaluate generated videos with a human-aligned two-part methodology: Process-aware Reasoning Verification uses structured QA and reasoning-phase diagnostics to detect temporal and causal failures, while Multi-dimensional Quality Assessment scores reasoning quality, temporal consistency, and visual aesthetics for ranking and reward modeling. We further introduce WorldRewardBench, a preference benchmark with approximately 6K expert-annotated pairs over 1.4K videos, supporting pair-wise and point-wise reward-model evaluation. Across modern video generators, our results expose a persistent gap between visual plausibility and world reasoning: videos can look convincing while failing dynamics, causality, or information preservation. We will release our benchmarks and evaluation toolkit to support community research on genuinely world-aware video generation at https://github.com/UniX-AI-Lab/WorldReasonBench/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WorldReasonBench adds a structured test set and dual human eval for world-state prediction in video models, but curation and annotation choices could artifactually widen the reported gap between visuals and reasoning.

read the letter

Hi, the main thing to know is that this paper gives the field a new benchmark with 436 cases and structured QA across four dimensions plus a 6K-pair reward set, aimed at checking whether video generators can actually simulate consistent future states rather than just look good. It frames the task as initial state plus action leading to physically, causally, and informationally coherent video output, which is a useful shift from pure visual metrics. The release of the data and toolkit is a practical plus that lets others run their own tests. The two-part method—process-aware verification via QA and multi-dimensional scoring—tries to separate reasoning failures from aesthetics in a concrete way. That said, the central claim of a persistent gap rests on author-curated cases and human judgments for diagnostics and quality. Without details on inter-annotator agreement, coverage validation, or how they avoided over-representing easy failure modes, it's possible the setup amplifies the difference rather than measuring it cleanly. The abstract flags results but the strength of the evidence depends on those missing checks. This is aimed at groups building video models for planning, robotics, or simulation. Readers who need concrete evaluation tools will get something usable from it even if they tweak the protocol later. I would send it to peer review because benchmarks like this are worth referee time to refine, not because the current version is flawless.

Referee Report

2 major / 2 minor

Summary. The paper introduces WorldReasonBench, a benchmark with 436 author-curated test cases and structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories, to evaluate video generators as world-state predictors given an initial state and action. It proposes a two-part human-aligned evaluation: Process-aware Reasoning Verification (using structured QA and phase diagnostics for temporal/causal failures) and Multi-dimensional Quality Assessment (scoring reasoning quality, temporal consistency, and aesthetics). The work also releases WorldRewardBench (~6K expert preference pairs over 1.4K videos) and reports a persistent gap between visual plausibility and failures in dynamics, causality, or information preservation across models like Seedance2.0 and Veo3.1. The benchmarks and toolkit are to be open-sourced.

Significance. If the benchmark validity holds, the work is significant for shifting video generation evaluation from visual fidelity toward verifiable world reasoning, which is increasingly relevant as models are positioned as simulators. Explicit strengths include the release of the full benchmark, evaluation toolkit, and preference data, which enables reproducibility, community extensions, and falsifiable testing of future claims. The structured dimensions and subcategories provide a concrete, extensible framework that could support reward modeling and model improvement beyond current visual metrics.

major comments (2)

[Section 3] Benchmark construction (Section 3): The central claim of a persistent gap between visual quality and world reasoning rests on the 436 test cases accurately isolating reasoning failures. The manuscript provides no quantitative coverage analysis, diversity metrics, or inter-curator agreement for the author-curated cases and QA annotations; without these, selection bias toward easily detectable failure modes cannot be ruled out and directly undermines the robustness of the reported gap.
[Section 4] Evaluation methodology (Section 4): The Process-aware Reasoning Verification and Multi-dimensional Quality Assessment rely on human judgments to detect dynamics/causality/information failures. No inter-annotator agreement statistics, correlation with automated proxies, or external validation of the two-part protocol are reported; this is load-bearing because unquantified subjectivity in reasoning-phase diagnostics could artifactually produce the claimed separation from visual plausibility.

minor comments (2)

[Abstract and Section 5] Clarify the exact model versions evaluated (e.g., Seedance2.0, Veo3.1) and ensure they are consistently referenced with citations or links in the experimental setup.
The GitHub release promise is welcome; adding a permanent archive link (e.g., Zenodo DOI) would strengthen long-term reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on benchmark construction and evaluation methodology. We address each major comment below and will revise the manuscript accordingly to strengthen the claims.

read point-by-point responses

Referee: [Section 3] Benchmark construction (Section 3): The central claim of a persistent gap between visual quality and world reasoning rests on the 436 test cases accurately isolating reasoning failures. The manuscript provides no quantitative coverage analysis, diversity metrics, or inter-curator agreement for the author-curated cases and QA annotations; without these, selection bias toward easily detectable failure modes cannot be ruled out and directly undermines the robustness of the reported gap.

Authors: We agree that quantitative metrics would strengthen the evidence against selection bias. In the revised manuscript, we will add: (1) coverage analysis with the distribution of the 436 test cases across the four reasoning dimensions and 22 subcategories; (2) diversity metrics including scenario variety and action types; and (3) inter-curator agreement statistics for the QA annotations, computed via Fleiss' kappa on a double-annotated subset. These will demonstrate representativeness while preserving the expert curation process described in Section 3. revision: yes
Referee: [Section 4] Evaluation methodology (Section 4): The Process-aware Reasoning Verification and Multi-dimensional Quality Assessment rely on human judgments to detect dynamics/causality/information failures. No inter-annotator agreement statistics, correlation with automated proxies, or external validation of the two-part protocol are reported; this is load-bearing because unquantified subjectivity in reasoning-phase diagnostics could artifactually produce the claimed separation from visual plausibility.

Authors: We concur that reliability metrics are essential. The revised manuscript will report inter-annotator agreement (e.g., Cohen's kappa) for both Process-aware Reasoning Verification and Multi-dimensional Quality Assessment. We will also include correlations with automated proxies such as video consistency models and expand the discussion of protocol validation steps. The structured QA annotations and phase diagnostics were designed to reduce subjectivity, but these additions will further substantiate the separation between visual plausibility and reasoning failures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark creation without reductive derivations

full rationale

The paper introduces WorldReasonBench (436 curated cases with QA annotations) and WorldRewardBench (~6K pairs) plus a two-part human evaluation protocol, then reports empirical gaps on existing video generators. No equations, first-principles derivations, fitted parameters, or predictions are claimed that could reduce to the inputs by construction. The contribution is the benchmark and its application; findings are self-contained empirical observations rather than any self-referential chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the assumption that world-state consistency can be decomposed into four measurable dimensions and assessed via curated QA pairs and human preference data; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Structured QA annotations on physical, social, logical, and informational consistency can serve as reliable ground truth for evaluating world-state prediction.
Invoked when defining the benchmark's test cases and evaluation methodology.

pith-pipeline@v0.9.0 · 5609 in / 1251 out tokens · 38929 ms · 2026-05-12T03:34:53.254430+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean, Cost/FunctionalEquation.lean reality_from_one_distinction, washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

WorldReasonBench contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories... Process-aware Reasoning Verification uses structured QA and reasoning-phase diagnostics to detect temporal and causal failures
IndisputableMonolith/Foundation/ArrowOfTime.lean, AlexanderDuality.lean arrow_from_z, alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

videos can look convincing while failing dynamics, causality, or information preservation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 8 internal anchors

[1]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. 2024. URL https://openai. com/research/video-generation-models-as-world-simulators, 3(1):3, 2024

work page 2024
[2]

On extending the bradley-terry model to accommodate ties in paired comparison experiments.Journal of the American Statistical Association, 65(329):317–328, 1970

Roger R Davidson. On extending the bradley-terry model to accommodate ties in paired comparison experiments.Journal of the American Statistical Association, 65(329):317–328, 1970

work page 1970
[3]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan

Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, et al. Videoscore2: Think before you score in generative video evaluation.arXiv preprint arXiv:2509.22799, 2025

work page arXiv 2025
[5]

Ruler-bench: Probing rule-based reasoning abilities of next-level video generation models for vision foundation intelligence.arXiv preprint arXiv:2512.02622, 2025

Xuming He, Zehao Fan, Hengjia Li, Fan Zhuo, Hankun Xu, Senlin Cheng, Di Weng, Haifeng Liu, Can Ye, and Boxi Wu. Ruler-bench: Probing rule-based reasoning abilities of next-level video generation models for vision foundation intelligence.arXiv preprint arXiv:2512.02622, 2025

work page arXiv 2025
[6]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017
[7]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

work page 2024
[8]

How far is video generation from world model: A physical law perspective

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

work page arXiv 2024
[9]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Beyond the last frame: Process-aware evaluation for generative video reasoning.arXiv preprint arXiv:2512.24952, 2026

Yifan Li, Yukai Gu, Yingqian Min, Zikang Liu, Yifan Du, Kun Zhou, Min Yang, Wayne Xin Zhao, and Minghui Qiu. Viper: Process-aware evaluation for generative video reasoning.arXiv preprint arXiv:2512.24952, 2025

work page arXiv 2025
[11]

Can world simulators reason? Gen-ViRe: A generative visual reasoning benchmark.arXiv preprint arXiv:2511.13853, 2025

Xinxin Liu, Zhaopan Xu, Ming Li, Kai Wang, Yong Jae Lee, and Yuzhang Shang. Can world simulators reason? gen-vire: A generative visual reasoning benchmark.arXiv preprint arXiv:2511.13853, 2025

work page arXiv 2025
[12]

Evalcrafter: Benchmarking and evaluating large video generation models

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139–22149, 2024

work page 2024
[13]

Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation

Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation. Advances in Neural Information Processing Systems, 36:62352–62387, 2023. 10

work page 2023
[14]

V-ReasonBench: Toward unified reasoning benchmark suite for video generation models.arXiv preprint arXiv:2511.16668, 2025

Yang Luo, Xuanlei Zhao, Baijiong Lin, Lingting Zhu, Liyao Tang, Yuqi Liu, Ying-Cong Chen, Shengju Qian, Xin Wang, and Yang You. V-reasonbench: Toward unified reasoning benchmark suite for video generation models.arXiv preprint arXiv:2511.16668, 2025

work page arXiv 2025
[15]

Videoeval-pro: Robust and realistic long video understanding evaluation.arXiv preprint arXiv:2505.14640, 2025

Wentao Ma, Weiming Ren, Yiming Jia, Zhuofeng Li, Ping Nie, Ge Zhang, and Wenhu Chen. Videoeval-pro: Robust and realistic long video understanding evaluation.arXiv preprint arXiv:2505.14640, 2025

work page arXiv 2025
[16]

Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363,

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024

work page arXiv 2024
[17]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Worldsimbench: T owards video generation models as world simulators, 2024

Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

work page arXiv 2024
[19]

T2v- compbench: A comprehensive benchmark for compositional text-to-video generation

Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v- compbench: A comprehensive benchmark for compositional text-to-video generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 8406–8416, 2025

work page 2025
[20]

Qwen Team. Qwen3. 5: Accelerating productivity with native multimodal agents, 2026

work page 2026
[21]

Qwen Team. Qwen3. 5: Towards native multimodal agents.URL: https://qwen. ai/blog, 2026

work page 2026
[22]

Fvd: A new metric for video generation

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019

work page 2019
[23]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

A very big video reasoning suite

Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, et al. A very big video reasoning suite.arXiv preprint arXiv:2602.20159, 2026

work page arXiv 2026
[25]

Videoverse: How far is your t2v generator from a world model?arXiv preprint arXiv:2510.08398, 2025

Zeqing Wang, Xinyu Wei, Bairui Li, Zhen Guo, Jinrui Zhang, Hongyang Wei, Keze Wang, and Lei Zhang. Videoverse: How far is your t2v generator from a world model?arXiv preprint arXiv:2510.08398, 2025

work page arXiv 2025
[26]

Video models are zero-shot learners and reasoners

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328, 2025

work page internal anchor Pith review arXiv 2025
[27]

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Keming Wu, Zuhao Yang, Kaichen Zhang, Shizun Wang, Haowei Zhu, Sicong Leng, Zhongyu Yang, Qijie Wang, Sudong Wang, Ziting Wang, et al. Visual generation in the new era: An evolution from atomic mapping to agentic world modeling.arXiv preprint arXiv:2604.28185, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

work page 2018
[30]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 11

work page internal anchor Pith review arXiv 2025
[31]

Attention is all you need

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 12 A Representative Examples of WorldReason-Bench To provide a more intuitive understanding ...

work page 2023
[32]

3 indicates that the video captures part of the intended transition from the input image but still contains noticeable reasoning ,→errors or inconsistencies

Reasoning Correctness ( 1 indicates that the video fails to model the intended world−state transition from the input image, violates the required ,→causal / physical / logical relation, or clearly misunderstands the instruction. 3 indicates that the video captures part of the intended transition from the input image but still contains noticeable reasoning...

work page
[33]

3 indicates that the core content from the input image is present but continuity is only partially preserved, with visible ,→instability or abrupt state changes

Content Fidelity & Continuity ( 1 indicates that key entities, attributes, or temporal states from the input image are missing, unstable, or incoherent across ,→frames. 3 indicates that the core content from the input image is present but continuity is only partially preserved, with visible ,→instability or abrupt state changes. 5 indicates that the requi...

work page
[34]

reasoning

Visual Aesthetics ( 1 indicates severe visual defects such as distortion, implausible or distracting motion, poor composition, or low rendering ,→quality. 3 indicates acceptable but imperfect visual quality with noticeable artifacts, limited appeal, or temporal presentation that is ,→only partially convincing. 5 indicates strong overall visual quality wit...

work page
[35]

This includes causal validity, physical plausibility, logical consistency, and correct state evolution ,→over time

Reasoning Correctness Whether the video correctly models the intended world−state transition implied by the input image together with the prompt ,→or instruction. This includes causal validity, physical plausibility, logical consistency, and correct state evolution ,→over time

work page
[36]

Content Fidelity & Continuity Whether the required visual content from the input image is faithfully preserved and whether entities, attributes, and temporal ,→states remain coherent and continuous throughout the video

work page
[37]

Use the following priority when making the final decision: − Reasoning Correctness is the primary criterion

Visual Aesthetics Whether the video is visually appealing, well−rendered, naturally animated, and free from obvious artifacts or severe ,→distortions. Use the following priority when making the final decision: − Reasoning Correctness is the primary criterion. − Content Fidelity & Continuity is the secondary criterion. − Visual Aesthetics is the tertiary c...

work page
[38]

Model A is better: [[A>B]]

work page
[39]

Model B is better: [[B>A]]

work page
[40]

Tie, relatively the same acceptable quality: [[A=B=Good]]

work page
[41]

visually polished but semantically wrong

Both are bad: [[A=B=Bad]] Table 13: Pair-wise comparison prompt for Multi-dimensional Quality Assessment. The VLM directly compares two candidate videos and outputs a preference verdict. 20 D Full Two-Component WorldReasonBench Results We separate the detailed WorldReasonBench results by evaluation component and top-level reasoning dimension. Process-awar...

work page 2054