pith. sign in

arxiv: 2606.00793 · v2 · pith:D55KQMVHnew · submitted 2026-05-30 · 💻 cs.CV

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

Pith reviewed 2026-06-28 19:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords video world modelsmemory capabilityentity consistencyenvironment consistencycausal consistencylong-term retentionbenchmark evaluationvideo generation
0
0 comments X

The pith

MBench reveals that video world models have systemic limitations in long-term state retention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video world models generate visually plausible sequences yet often lose track of objects, scenes, and cause-effect relations across extended time. The paper introduces MBench to quantify memory by dividing it into three hierarchical dimensions of entity consistency, environment consistency, and causal consistency, broken into twelve measurable sub-dimensions. The benchmark draws on curated real long videos and scores outputs with both rule-based metrics and vision-language model judgments. When applied to current leading models, the evaluations show repeated failures to preserve stable internal states over long horizons. This gap between short-term visual quality and functional memory points to a missing requirement for effective world models.

Core claim

The paper presents MBench, which decomposes memory capability into three hierarchical dimensions—entity consistency, environment consistency, and causal consistency—refined into twelve quantifiable sub-dimensions. The benchmark is constructed from rigorously curated real-captured long videos and assessed through rule-based quantitative metrics together with VLM judgments to enable objective consistency measurement. Extensive evaluations on mainstream state-of-the-art video world models demonstrate critical systemic limitations in long-term state retention.

What carries the argument

MBench, the benchmark that organizes memory evaluation into three hierarchical consistency dimensions and twelve sub-dimensions, measured on real videos by rule-based metrics and VLM judgments.

If this is right

  • Video world models require new mechanisms explicitly designed for long-term state retention.
  • Progress in the field can now be tracked with a standardized set of memory metrics.
  • Research priorities should shift from short-term visual fidelity toward sustained consistency across interactions.
  • Models that improve on the three consistency dimensions may enable more reliable downstream uses in simulation and planning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models incorporating explicit recurrent or memory-augmented components could be directly compared on MBench to test whether architectural changes close the observed gaps.
  • The benchmark could be adapted to interactive or controllable generation settings to check whether memory holds when users steer the video.
  • High performance on MBench may correlate with better transfer to real-world tasks that demand persistent object and environment tracking.
  • Reliance on VLM judgments for some sub-dimensions could be cross-checked against human raters to measure agreement and potential biases.

Load-bearing premise

The three hierarchical dimensions and twelve sub-dimensions together with rule-based metrics and VLM judgments on curated real videos supply an objective and comprehensive characterization of memory capability.

What would settle it

A new video world model that achieves high scores across all MBench dimensions yet produces inconsistent object positions, scene layouts, or causal outcomes when tested on long uncurated interaction sequences outside the benchmark set.

read the original abstract

Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text-video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long-term horizons and complex interactions. To address this gap, we present \textbf{MBench}, a comprehensive benchmark dedicated to quantifying and evaluating the memory capability of video world models. We systematically decompose the memory capability of video world models into three hierarchical and complementary core dimensions: entity consistency, environment consistency, and causal consistency, which are further refined into 12 quantifiable sub-dimensions for comprehensive characterization of long-term memory. Our benchmark is built upon rigorously curated real-captured long videos, and evaluated by rule-based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state-of-the-art video world models reveal critical systemic limitations of existing methods in long-term state retention, providing a standardized benchmark and clear research direction to advance the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MBench, a benchmark for quantifying memory capability in video world models. It decomposes this capability into three hierarchical dimensions—entity consistency, environment consistency, and causal consistency—further refined into 12 sub-dimensions. The benchmark is constructed from curated real-captured long videos and evaluated using a combination of rule-based quantitative metrics and VLM-based judgments to assess consistency. Extensive evaluations of mainstream SOTA video world models are reported to reveal critical systemic limitations in long-term state retention.

Significance. If the proposed dimensions and evaluation protocol are shown to isolate long-term memory rather than other generation factors, MBench would address a clear gap in existing video generation benchmarks, which focus primarily on visual fidelity and short-term coherence. The use of real videos and mixed rule-based/VLM scoring provides a reproducible starting point for standardized assessment.

major comments (2)
  1. [Benchmark Construction and Evaluation sections] The central claim that the evaluations 'reveal critical systemic limitations' depends on the 12 sub-dimensions and VLM judgments providing an objective proxy for memory; however, the manuscript provides no external validation (e.g., correlation with human judgments or downstream task performance) that these metrics isolate long-term state retention rather than generation artifacts.
  2. [Abstract and Results sections] The abstract and description state that the benchmark enables 'objective and comprehensive consistency assessment,' but no quantitative results, model-specific scores, or statistical analysis of inter-metric agreement are referenced to support the strength of the systemic-limitation conclusion.
minor comments (1)
  1. [Abstract] In the abstract, 'rule-based quantitative matrices' appears to be a typographical error for 'metrics'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the need for stronger validation and presentation of results. We address each major comment below with plans for revision where appropriate.

read point-by-point responses
  1. Referee: [Benchmark Construction and Evaluation sections] The central claim that the evaluations 'reveal critical systemic limitations' depends on the 12 sub-dimensions and VLM judgments providing an objective proxy for memory; however, the manuscript provides no external validation (e.g., correlation with human judgments or downstream task performance) that these metrics isolate long-term state retention rather than generation artifacts.

    Authors: We agree that external validation strengthens the claim that the metrics isolate memory capabilities. The 12 sub-dimensions are derived from established video understanding principles, and the mixed rule-based/VLM protocol aims for objectivity, but we acknowledge the absence of human correlation data in the current manuscript. In the revised version, we will add a human study section reporting agreement rates (e.g., Cohen's kappa) between automated scores and human annotations on a sampled subset of videos. This will directly address potential confounding with generation artifacts. Downstream task performance lies outside the scope of this benchmark-focused paper. revision: yes

  2. Referee: [Abstract and Results sections] The abstract and description state that the benchmark enables 'objective and comprehensive consistency assessment,' but no quantitative results, model-specific scores, or statistical analysis of inter-metric agreement are referenced to support the strength of the systemic-limitation conclusion.

    Authors: The Results section of the manuscript already reports model-specific scores across all 12 sub-dimensions for multiple SOTA models, along with qualitative visualizations. However, we agree that explicit inter-metric agreement statistics (e.g., correlations between dimensions) are not provided, and the abstract summarizes without numbers. We will add an inter-metric agreement analysis (correlation matrix) to the Results section and revise the abstract to reference key quantitative findings supporting the systemic limitations conclusion. revision: partial

Circularity Check

0 steps flagged

No significant circularity in benchmark definition or evaluation

full rationale

This is a benchmark paper that introduces MBench by explicitly defining three hierarchical dimensions (entity, environment, causal consistency) and 12 sub-dimensions, then applies rule-based metrics and VLM judgments to curated real videos. No equations, first-principles derivations, predictions, or fitted parameters exist that could reduce to inputs by construction. No self-citation chains or uniqueness theorems are invoked to justify the core claims. The evaluation of SOTA models' limitations follows directly from applying the defined metrics, which is standard and self-contained for a new benchmark without internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the premise that memory capability can be decomposed into the stated consistencies and measured reliably by the chosen metrics; no free parameters or invented physical entities are introduced.

axioms (2)
  • domain assumption VLM-based judgments combined with rule-based metrics can provide objective consistency assessment
    Invoked in the abstract when describing evaluation of the benchmark
  • domain assumption The curated real-captured long videos are representative for testing long-term memory
    Stated as the basis for the benchmark construction

pith-pipeline@v0.9.1-grok · 5783 in / 1227 out tokens · 17612 ms · 2026-06-28T19:03:10.211487+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Current World Models Lack a Persistent State Core

    cs.CV 2026-06 unverdicted novelty 6.0

    Current world models fail to evolve internal state when unobserved and instead resume scenes at the last observed state, as diagnosed by the new WRBench benchmark across 23 models and 9600 videos.

Reference graph

Works this paper leans on

101 extracted references · 51 canonical work pages · cited by 1 Pith paper · 31 internal anchors

  1. [1]

    ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W

    Sand. ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W. Q. Zhang, Weifeng Luo, Xiaoyang Kang, Yuchen Sun, Yue Cao, Yunpeng Huang, Yutong Lin, Yuxin Fang, Zewei Tao, Zheng Zhang, Zhongshu Wang, Zixun Liu, Dai Shi, Guoli Su, Hanwen Sun, Hong Pan, Jie Wang, Jiexin Sheng, Min Cui, Min Hu, Ming Yan, Shuchen...

  2. [2]

    World simulation with video foundation models for physical ai, 2025

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai, 2025

  3. [3]

    Diffusion for world modeling: Visual details matter in atari.Advancesin Neural Information Processing Systems, 37:58757–58791, 2024

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advancesin Neural Information Processing Systems, 37:58757–58791, 2024

  4. [4]

    Physics-aware-videos

    BAAI-DatacCube. Physics-aware-videos. https://huggingface.co/datasets/BAAI-DataCube/ Physics-aware-videos, 2025

  5. [5]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  6. [6]

    VideoPhy: Evaluating Physical Commonsense for Video Generation

    Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation. arXiv preprint arXiv:2406.03520, 2024

  7. [7]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  8. [8]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-firstInternational Conference on Machine Learning, 2024

  9. [9]

    Seed2.0 model card: Towards intelligence frontier for real-world complexity

    ByteDance Seed Team. Seed2.0 model card: Towards intelligence frontier for real-world complexity. 2026

  10. [10]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advancesin Neural Information Processing Systems, 37:24081–24125, 2024

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advancesin Neural Information Processing Systems, 37:24081–24125, 2024

  11. [11]

    Skyreels-v2: Infinite-length film generative model, 2025

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. Skyreels-v2: Infinite-length film generative model, 2025

  12. [12]

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation.arXiv preprint arXiv:2310.19512, 2023

  13. [13]

    VRAG: Learning World Models for Interactive Video Generation

    Taiye Chen, Xun Hu, Zihan Ding, and Chi Jin. Learning world models for interactive video generation.arXiv preprint arXiv:2505.21996, 2025

  14. [14]

    Carreira, J

    Aidan Clark, Jeff Donahue, and Karen Simonyan. Adversarial video generation on complex datasets.arXiv preprint arXiv:1907.06571, 2019

  15. [15]

    One-minute video generation with test-time training

    Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17702–17711, 2025

  16. [16]

    Arcface: Additive angular margin loss for deep face recognition

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019

  17. [17]

    Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

    Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025. 16

  18. [18]

    Vista: A generalizable driving world model with high fidelity and versatile controllability.Advancesin Neural Information Processing Systems, 37:91560–91596, 2024

    Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.Advancesin Neural Information Processing Systems, 37:91560–91596, 2024

  19. [19]

    Veo-3 technical report

    Google DeepMind. Veo-3 technical report. Technical report, Google DeepMind, May 2025

  20. [20]

    Long-Context Autoregressive Video Modeling with Next-Frame Prediction

    Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long-context autoregressive video modeling with next-frame prediction. arXiv preprint arXiv:2503.19325, 2025

  21. [21]

    T2vphysbench: A first-principles benchmark for physical consistency in text-to-video gen- eration.arXiv preprint arXiv:2505.00337, 2025

    Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, and Jiale Zhao. T2vphysbench: A first- principles benchmark for physical consistency in text-to-video generation.arXiv preprint arXiv:2505.00337, 2025

  22. [22]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023

  23. [23]

    Recurrent world models facilitate policy evolution.Advances in neural information processing systems, 31, 2018

    David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution.Advances in neural information processing systems, 31, 2018

  24. [24]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

  25. [25]

    Video-bench: Human-aligned video generation benchmark

    Hui Han, Siyuan Li, Jiaqi Chen, Yiwen Yuan, Yuling Wu, Yufan Deng, Chak Tou Leong, Hanwen Du, Junchen Fu, Youhua Li, et al. Video-bench: Human-aligned video generation benchmark. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18858–18868, 2025

  26. [26]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, Baixin Xu, Hao-Xiang Guo, Kaixiong Gong, Cyrus Wu, Wei Li, Xuchen Song, Yang Liu, Eric Li, and Yahui Zhou. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model. arXiv preprint arXiv:2508.13009, 2025

  27. [27]

    Streamingt2v: Consistent, dynamic, and extendable long video generation from text

    Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2568–2577, 2025

  28. [28]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advancesin neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advancesin neural information processing systems, 30, 2017

  29. [29]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022

  30. [30]

    Video diffusion models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advancesin neural information processing systems, 35:8633–8646, 2022

  31. [31]

    Gaia-1: A generative world model for autonomous driving, 2023

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving, 2023

  32. [32]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  33. [33]

    VBench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  34. [34]

    VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactionson Pattern Analysis and Machine Intelligence, 2025

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactionson Pattern Analysis and Machine Intellig...

  35. [35]

    Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

    Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025. 17

  36. [36]

    Openclip.Zenodo, 2021

    Gabriel Ilharco, Mitchell Wortsman, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, et al. Openclip.Zenodo, 2021

  37. [37]

    T2vbench: Benchmarking temporal dynamics for text-to-video generation

    Pengliang Ji, Chuyang Xiao, Huilin Tai, and Mingxiao Huo. T2vbench: Benchmarking temporal dynamics for text-to-video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5325–5335, 2024

  38. [38]

    Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025

    Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, and Hengshuang Zhao. Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025

  39. [39]

    Pyramidal flow matching for efficient video generative modeling,

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954, 2024

  40. [40]

    Fifo-diffusion: Generating infinite videos from text without training.Advancesin Neural Information Processing Systems, 37:89834–89868, 2024

    Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without training.Advancesin Neural Information Processing Systems, 37:89834–89868, 2024

  41. [41]

    Tanks and temples: Benchmarking large-scale scene reconstruction

    Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics, 36(4), 2017

  42. [42]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  43. [43]

    Gonzalez, Ion Stoica, Song Han, and Yao Lu

    Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E. Gonzalez, Ion Stoica, Song Han, and Yao Lu. Worldmodelbench: Judging video generation models as world models. 2025

  44. [44]

    Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation

    Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, et al. Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. arXiv preprint arXiv:2412.00115, 2024

  45. [45]

    Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

    Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, and Guanghui Ren. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

  46. [46]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  47. [47]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

  48. [48]

    Evalcrafter: Benchmarking and evaluating large video generation models

    Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139–22149, 2024

  49. [49]

    Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024

  50. [50]

    Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advancesin Neural Information Processing Systems, 36:62352–62387, 2023

    Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advancesin Neural Information Processing Systems, 36:62352–62387, 2023

  51. [51]

    Out of sight, out of mind? evaluating state evolution in video world models, 2026

    Ziqi Ma, Mengzhan Liufu, and Georgia Gkioxari. Out of sight, out of mind? evaluating state evolution in video world models, 2026

  52. [52]

    Yume-1.5: A text-controlled interactive world generation model

    Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

  53. [53]

    Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

    Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025. 18

  54. [54]

    Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

    Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation. arXiv preprint arXiv:2410.05363, 2024

  55. [55]

    Transformers are sample-efficient world models

    Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample-efficient world models.arXiv preprint arXiv:2209.00588, 2022

  56. [56]

    Sora2.https://openai.com/zh-Hans-CN/index/sora-2/, 2025

    OpenAI. Sora2.https://openai.com/zh-Hans-CN/index/sora-2/, 2025

  57. [57]

    Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Labatu...

  58. [58]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  59. [59]

    Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

    Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

  60. [60]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  61. [61]

    Improved techniques for training gans.Advancesin neural information processing systems, 29, 2016

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advancesin neural information processing systems, 29, 2016

  62. [62]

    Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, Yujiao Du, Haojie Duanmu, Junliang Fan, Jiarui Fang, Jing Fang, Zetao Fang, Chengjian Feng, Yu Gao, Diandian Gu, Dong Guo, Hanzhong Guo, Qiushan Guo, Boyang Hao, Hon...

  63. [63]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

  64. [64]

    Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory

    Skywork AI Matrix-Game Team. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory. Technical report, 2026

  65. [65]

    History-Guided Video Diffusion

    Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion.arXiv preprint arXiv:2502.06764, 2025

  66. [66]

    T2v-compbench: A comprehensive benchmark for compositional text-to-video generation

    Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8406–8416, 2025. 19

  67. [67]

    Longcat-video technical report, 2025

    Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, and Tong Zhang. Longcat-video technical report, 2025

  68. [68]

    Advancing Open-source World Models

    Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models.arXiv preprint arXiv:26...

  69. [69]

    Mocogan: Decomposing motion and content for video generation

    Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018

  70. [70]

    UniFOLM-WMA-0: A world-model-action (wma) framework under UniFOLM family, 2025

    Unitree. UniFOLM-WMA-0: A world-model-action (wma) framework under UniFOLM family, 2025

  71. [71]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprintarXiv:1812.01717, 2018

  72. [72]

    How close are world models to the physical world?arXiv preprint arXiv:2501.xxxxx, 2025

    Rishi Upadhyay, Howard Zhang, Zhirong Lu, Lakshman Sundaram, Ayush Agrawal, Yilin Wu, Yunhao Ba, Alex Wong, Celso M de Melo, and Achuta Kadambi. How close are world models to the physical world?arXiv preprint arXiv:2501.xxxxx, 2025

  73. [73]

    Phenaki: Variable Length Video Generation From Open Domain Textual Description

    Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Moham- mad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022

  74. [74]

    Generating videos with scene dynamics.Advances in neural information processing systems, 29, 2016

    Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics.Advances in neural information processing systems, 29, 2016

  75. [75]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  76. [76]

    Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023

    Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023

  77. [77]

    Spatialvid: A large-scale video dataset with spatial annotations

    Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, et al. Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676, 2025

  78. [78]

    Drivedreamer: Towards real-world-driven world models for autonomous driving,

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-driven world models for autonomous driving.arXiv preprint arXiv:2309.09777, 2023

  79. [79]

    Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

    Zile Wang, Zexiang Liu, Jiaxing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory. arXiv preprint arXiv:2604.08995, 2026

  80. [80]

    Godiva: Generating open-domain videos from natural descriptions.arXiv preprint arXiv:2104.14806, 2021

    Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, and Nan Duan. Godiva: Generating open-domain videos from natural descriptions.arXiv preprint arXiv:2104.14806, 2021

Showing first 80 references.