pith. sign in

arxiv: 2606.10620 · v1 · pith:S3IEV2KPnew · submitted 2026-06-09 · 💻 cs.CV · cs.AI

Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency

Pith reviewed 2026-06-27 13:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords image generationbenchmarkspatiotemporal consistencyvisual world modelingtemporal reasoningkeyframe generationcausal consistency
0
0 comments X

The pith

Image generation models struggle to keep visual states consistent across ordered time steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ImageTime, a benchmark that asks image generation models to produce a single image containing four ordered key states of an action: initial state, action onset, transition state, and final state. Models receive an action instruction and optionally a reference image, then must satisfy stage-wise predicates, cross-frame temporal constraints, and avoid forbidden causal violations. A structured VLM-as-judge protocol using GPT-5.5 produces capability scores, diagnostic subscores, and failure labels. Multi-family testing shows where models succeed or drift when asked to maintain identities, spatial relations, and causal order over time.

Core claim

ImageTime introduces a four-keyframe protocol as a probe of visual world modeling, requiring image models to generate one image that depicts an initial state, action onset, transition state, and final state while obeying temporal constraints and avoiding causal violations, with scores assigned by a structured VLM judge.

What carries the argument

Four-keyframe generation task with stage-wise state predicates, cross-frame temporal constraints, and forbidden causal violations, evaluated under a VLM-as-judge protocol.

If this is right

  • High-performing models on ImageTime can directly support storyboarding, step-by-step illustration, and reference-guided editing workflows.
  • Diagnostic subscores isolate specific failure types such as identity drift or causal order violations across the four states.
  • Progressive task hierarchy allows measurement of incremental improvements in temporal coherence without requiring full video generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be adapted to test whether models improve when given explicit causal chain instructions rather than single action prompts.
  • Results may inform whether reference images help more with identity preservation than with transition logic.

Load-bearing premise

The GPT-5.5 VLM-as-judge protocol produces reliable, unbiased scores for spatiotemporal consistency and causal violations.

What would settle it

A systematic comparison showing frequent disagreement between human judges and GPT-5.5 scores on whether a generated image violates causal order or identity preservation would falsify the evaluation method.

read the original abstract

Image generation models now produce high-quality static images, yet their ability to represent how a visual world changes over time remains poorly understood. Practical workflows such as storyboarding, step-by-step illustration, reference-guided editing, and video previsualization require models to preserve identities, objects, spatial relations, and causal order across multiple visual states. Existing evaluations largely measure single-image correctness, compositional alignment, or video quality, leaving open whether an image model can coherently imagine a temporally ordered process. We introduce ImageTime, a diagnostic benchmark that uses spatiotemporal consistency as a behavioral probe of visual world modeling in image generation. Given an action instruction, and optionally a reference image specifying the initial state, a model must generate one image containing four ordered key states: initial state, action onset, transition state, and final state. This four-keyframe protocol is more temporally demanding than single-image generation while avoiding the confounds of dense video dynamics. ImageTime organizes tasks with a progressive capability hierarchy and decomposes each scenario into stage-wise state predicates, cross-frame temporal constraints, and forbidden causal violations. GPT-5.5 scores all generated images under a structured VLM-as-judge protocol, producing interpretable capability scores, diagnostic subscores, and failure labels. Through multi-family benchmarking, ImageTime reveals where current image generation systems succeed, fail, and drift when asked to maintain coherent visual world states over time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces ImageTime, a diagnostic benchmark for image generation models that requires generating a single image containing four ordered key states (initial state, action onset, transition state, final state) for a given action instruction (optionally with a reference image). Tasks are organized by a progressive capability hierarchy and decomposed into stage-wise predicates, temporal constraints, and forbidden violations; all outputs are scored for spatiotemporal consistency and causal violations via a structured GPT-5.5 VLM-as-judge protocol, with the goal of revealing success, failure, and drift patterns across model families in visual world modeling.

Significance. If the VLM-as-judge protocol proves reliable, ImageTime would address a genuine gap between single-image metrics and video evaluation by providing a lightweight yet temporally structured probe of identity preservation, spatial relations, and causal order. The four-keyframe format and explicit decomposition into predicates/constraints are methodologically sound ideas that could yield interpretable capability profiles.

major comments (1)
  1. [Abstract / VLM-as-judge protocol] Abstract (and the VLM-as-judge protocol description): the central claim that the benchmark 'reveals where current image generation systems succeed, fail, and drift' rests entirely on GPT-5.5 producing reliable scores and failure labels, yet no human calibration, inter-annotator agreement, or ablation against alternative judges is reported. This is load-bearing; without such validation the multi-family profiles risk being artifacts of the judge rather than measurements of the image models.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The major comment on VLM-as-judge validation is addressed below; we agree it is load-bearing and will strengthen the paper accordingly.

read point-by-point responses
  1. Referee: [Abstract / VLM-as-judge protocol] Abstract (and the VLM-as-judge protocol description): the central claim that the benchmark 'reveals where current image generation systems succeed, fail, and drift' rests entirely on GPT-5.5 producing reliable scores and failure labels, yet no human calibration, inter-annotator agreement, or ablation against alternative judges is reported. This is load-bearing; without such validation the multi-family profiles risk being artifacts of the judge rather than measurements of the image models.

    Authors: We agree that the absence of reported validation for the GPT-5.5 judge is a genuine limitation, as the benchmark's diagnostic claims depend on judge reliability. The manuscript describes the structured prompting protocol but does not include human calibration, agreement metrics, or judge ablations. In the revised version we will add a dedicated validation subsection: (1) human annotators will score a stratified sample of 200 images using the same predicate/constraint criteria, (2) we will report inter-annotator agreement (Cohen's kappa) among humans and between humans and GPT-5.5, and (3) we will ablate against an alternative judge (Claude-3.5-Sonnet) on the same sample. The abstract and methods will be updated to reference these results. This directly addresses the risk of judge artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a new benchmark (ImageTime) and an evaluation protocol using a VLM-as-judge without any mathematical derivation chain, fitted parameters, or equations that reduce to inputs by construction. No self-definitional steps, fitted-input predictions, load-bearing self-citations, uniqueness theorems, or ansatzes are present in the abstract or described protocol. The central claim rests on empirical multi-family benchmarking rather than self-referential reductions, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmark paper with no mathematical derivations; the central claim rests on the design choices of the four-state protocol and the reliability of the GPT-5.5 judge, which are introduced without external validation in the abstract.

axioms (1)
  • domain assumption VLM-as-judge (GPT-5.5) can accurately and consistently score spatiotemporal consistency, state predicates, and causal violations.
    Abstract states that GPT-5.5 scores all generated images under a structured VLM-as-judge protocol producing capability scores and failure labels.

pith-pipeline@v0.9.1-grok · 5786 in / 1056 out tokens · 16568 ms · 2026-06-27T13:32:57.048728+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 9 linked inside Pith

  1. [1]

    Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

  2. [2]

    Worldscore: A unified evaluation benchmark for world generation

    Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27713–27724, 2025

  3. [3]

    Worldmodelbench: Judging video generation models as world models.Advances in Neural Information Processing Systems, 38, 2026

    Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.Advances in Neural Information Processing Systems, 38, 2026

  4. [4]

    Simulating the visual world with artificial intelligence: A roadmap.arXiv preprint arXiv:2511.08585, 2025

    Jingtong Yue, Ziqi Huang, Zhaoxi Chen, Xintao Wang, Pengfei Wan, and Ziwei Liu. Simulating the visual world with artificial intelligence: A roadmap.arXiv preprint arXiv:2511.08585, 2025

  5. [5]

    Videophy: Evaluating physical commonsense for video generation

    Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation. InInternational Conference on Learning Representations, volume 2025, pages 102075–102121, 2025

  6. [6]

    How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

    Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

  7. [7]

    Tc-bench: Benchmarking temporal compositionality in text-to-video and image-to-video generation.arXiv preprint arXiv:2406.08656, 2024

    Weixi Feng, Jiachen Li, Michael Saxon, Tsu-jui Fu, Wenhu Chen, and William Yang Wang. Tc-bench: Benchmarking temporal compositionality in text-to-video and image-to-video generation.arXiv preprint arXiv:2406.08656, 2024

  8. [8]

    High-resolution image synthesis with latent diffusion models

    RobinRombach,AndreasBlattmann,DominikLorenz,PatrickEsser,andBjörnOmmer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  9. [9]

    Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

  10. [10]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In International Conference on Learning Representations, volume 2024, pages 1862–1874, 2024

  11. [11]

    Scalingrectifiedflowtransformersforhigh-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, DominikLorenz,AxelSauer,FredericBoesel,etal. Scalingrectifiedflowtransformersforhigh-resolution image synthesis. InForty-first international conference on machine learning, 2024

  12. [12]

    Improvingimagegenerationwithbettercaptions.ComputerScience.https://cdn

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, JoyceLee,YufeiGuo,etal. Improvingimagegenerationwithbettercaptions.ComputerScience.https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

  13. [13]

    Story2board: a training-free approach for expressive storyboard generation.arXiv preprint arXiv:2508.09983, 2025

    David Dinkevich, Matan Levy, Omri Avrahami, Dvir Samuel, and Dani Lischinski. Story2board: a training-free approach for expressive storyboard generation.arXiv preprint arXiv:2508.09983, 2025

  14. [14]

    Envision: Benchmarking unified understanding & generation for causal world process insights.arXiv preprint arXiv:2512.01816, 2025

    Juanxi Tian, Siyuan Li, Conghui He, Lijun Wu, and Cheng Tan. Envision: Benchmarking unified understanding & generation for causal world process insights.arXiv preprint arXiv:2512.01816, 2025. 2026.06 Preprint 17

  15. [15]

    Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091, 2023

    Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091, 2023

  16. [16]

    Autostudio: Crafting consistent subjects in multi-turn interactive image generation.arXiv preprint arXiv:2406.01388, 2024

    Junhao Cheng, Xi Lu, Hanhui Li, Khun Loun Zai, Baiqiao Yin, Yuhao Cheng, Yiqiang Yan, and Xiaodan Liang. Autostudio: Crafting consistent subjects in multi-turn interactive image generation.arXiv preprint arXiv:2406.01388, 2024

  17. [17]

    Multiref: Controllable image generation with multiple visual references

    Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Peter Sushko, Gaoyang Jiang, Yao Wan, and Ranjay Krishna. Multiref: Controllable image generation with multiple visual references. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 13325–13331, 2025

  18. [18]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  19. [19]

    T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

    Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

  20. [20]

    Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering

    Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023

  21. [21]

    Davidsonianscenegraph:Improvingreliabilityinfine-grainedevaluation for text-to-image generation

    Jaemin Cho, Yushi Hu, Jason Baldridge, Roopal Garg, Peter Anderson, Ranjay Krishna, Mohit Bansal, JordiPont-Tuset,andSuWang. Davidsonianscenegraph:Improvingreliabilityinfine-grainedevaluation for text-to-image generation. InInternational conference on learning representations, volume 2024, pages 15625–15645, 2024

  22. [22]

    Phybench: A physical commonsense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802, 2024

    Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yiran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, et al. Phybench: A physical commonsense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802, 2024

  23. [23]

    Revisiting text-to-image evaluation with gecko: on metrics, prompts, and human rating

    Olivia Wiles, Chuhan Zhang, Isabela Albuquerque, Ivana Kajić, Su Wang, Emanuele Bugliarello, Yasumasa Onoe, Pinelopi Papalampidi, Ira Ktena, Christopher Knutsen, et al. Revisiting text-to-image evaluation with gecko: on metrics, prompts, and human rating. InInternational Conference on Learning Representations, volume 2025, pages 272–287, 2025

  24. [24]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  25. [25]

    Evalcrafter: Benchmarking and evaluating large video generation models

    Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139–22149, 2024

  26. [26]

    Image as a world: Generating interactive world from single image via panoramic video generation.Advances in Neural Information Processing Systems, 38: 172611–172634, 2026

    Dongnan Gui, Xun Guo, Wengang Zhou, and Yan Lu. Image as a world: Generating interactive world from single image via panoramic video generation.Advances in Neural Information Processing Systems, 38: 172611–172634, 2026

  27. [27]

    A recipe for generating 3d worlds from a single image

    Katja Schwarz, Denis Rozumny, Samuel Rota Bulò, Lorenzo Porzi, and Peter Kontschieder. A recipe for generating 3d worlds from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3520–3530, 2025

  28. [28]

    Gpt-4v (ision) as a generalist evaluator for vision-language tasks.arXiv preprint arXiv:2311.01361, 2023

    XinluZhang,YujieLu,WeizhiWang,AnYan,JunYan,LiankeQin,HengWang,XifengYan,WilliamYang Wang, and Linda Ruth Petzold. Gpt-4v (ision) as a generalist evaluator for vision-language tasks.arXiv preprint arXiv:2311.01361, 2023

  29. [29]

    Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms

    Jiarui Wang, Huiyu Duan, Yu Zhao, Juntong Wang, Guangtao Zhai, and Xiongkuo Min. Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17312–17323, 2025

  30. [30]

    Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation

    Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas Guibas, Dahua Lin, and Gordon Wetzstein. Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22227–22238, 2024. 2026.06 Preprint 18

  31. [31]

    Imagenworld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks.arXiv preprint arXiv:2603.27862, 2026

    Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, et al. Imagenworld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks.arXiv preprint arXiv:2603.27862, 2026

  32. [32]

    Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022

  33. [33]

    Evaluating and improving compositional text-to-visual generation

    BaiqiLi,ZhiqiuLin,DeepakPathak,JiayaoLi,YixinFei,KewenWu,XideXia,PengchuanZhang,Graham Neubig, and Deva Ramanan. Evaluating and improving compositional text-to-visual generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5290–5301, 2024

  34. [34]

    Holistic evaluation of text-to-image models

    Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Teufel, Marco Bellagente, et al. Holistic evaluation of text-to-image models. Advances in Neural Information Processing Systems, 36:69981–70011, 2023

  35. [35]

    Oneig-bench:Omni-dimensionalnuancedevaluationforimagegeneration.Advances in Neural Information Processing Systems, 38, 2026

    Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, andHai-BaoChen. Oneig-bench:Omni-dimensionalnuancedevaluationforimagegeneration.Advances in Neural Information Processing Systems, 38, 2026

  36. [36]

    Viescore: Towards explainable metrics for conditional image synthesis evaluation

    Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12268–12290, 2024

  37. [37]

    Evaluating text-to-visual generation with image-to-text generation

    Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. InEuropean Conference on Computer Vision, pages 366–384. Springer, 2024

  38. [38]

    Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023

  39. [39]

    Pick-a-pic: Anopendatasetofuserpreferencesfortext-to-imagegeneration.Advancesinneuralinformationprocessing systems, 36:36652–36663, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: Anopendatasetofuserpreferencesfortext-to-imagegeneration.Advancesinneuralinformationprocessing systems, 36:36652–36663, 2023

  40. [40]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

  41. [41]

    Dreamtuner: Single image is enough for subject-driven generation.arXiv preprint arXiv:2312.13691, 2023

    Miao Hua, Jiawei Liu, Fei Ding, Wei Liu, Jie Wu, and Qian He. Dreamtuner: Single image is enough for subject-driven generation.arXiv preprint arXiv:2312.13691, 2023

  42. [42]

    Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.Advances in Neural Information Processing Systems, 36:30146–30166, 2023

    Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.Advances in Neural Information Processing Systems, 36:30146–30166, 2023

  43. [43]

    Ip-adapter:Textcompatibleimagepromptadapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

    HuYe,JunZhang,SiboLiu,XiaoHan,andWeiYang. Ip-adapter:Textcompatibleimagepromptadapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

  44. [44]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

  45. [45]

    Dreambench++: A human-aligned benchmark for personalized image generation

    Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Zheng Ge, Xiangyu Zhang, Shu-Tao Xia, et al. Dreambench++: A human-aligned benchmark for personalized image generation. In International Conference on Learning Representations, volume 2025, pages 46010–46032, 2025

  46. [46]

    Dsh-bench: A difficulty-and scenario-aware benchmark with hierarchical subject taxonomy for subject-driven text-to-image generation.arXiv preprint arXiv:2603.08090, 2026

    Zhenyu Hu, Qing Wang, Te Cao, Luo Liao, Longfei Lu, Liqun Liu, Shuang Li, Hang Chen, Mengge Xue, Yuan Chen, et al. Dsh-bench: A difficulty-and scenario-aware benchmark with hierarchical subject taxonomy for subject-driven text-to-image generation.arXiv preprint arXiv:2603.08090, 2026

  47. [47]

    FLUX.2: Frontier visual intelligence

    Black Forest Labs. FLUX.2: Frontier visual intelligence. https://bfl.ai/blog/flux-2, 2025

  48. [48]

    GPT Image 2 model

    OpenAI. GPT Image 2 model. https://developers.openai.com/api/docs/models/gpt-image-2, 2026

  49. [49]

    System card: ChatGPT Images 2.0 and thinking mode

    OpenAI. System card: ChatGPT Images 2.0 and thinking mode. https://deploymentsafety.openai.com/ chatgpt-images-2-0/chatgpt-images-2-0.pdf, 2026. 2026.06 Preprint 19

  50. [50]

    Gemini 3.1 Flash Image model card

    Google DeepMind. Gemini 3.1 Flash Image model card. https://deepmind.google/models/model-car ds/gemini-3-1-flash-image/, 2026

  51. [51]

    Deeper thinking, more accurate generation: Introducing Seedream 5.0 Lite

    ByteDance Seed Team. Deeper thinking, more accurate generation: Introducing Seedream 5.0 Lite. https://seed.bytedance.com/en/blog/deeper-thinking-more-accurate-generation-introducing-seedr eam-5-0-lite, 2026

  52. [52]

    Qwen-Image-2512

    Qwen Team. Qwen-Image-2512. https://huggingface.co/Qwen/Qwen-Image-2512, 2026

  53. [53]

    Qwen-image technical report, 2025

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

  54. [54]

    HunyuanImage 2.1: An efficient diffusion model for high-resolution (2k) text-to-image generation

    Tencent Hunyuan Team. HunyuanImage 2.1: An efficient diffusion model for high-resolution (2k) text-to-image generation. https://github.com/Tencent-Hunyuan/HunyuanImage-2.1, 2025

  55. [55]

    Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

  56. [56]

    Training- free consistent text-to-image generation.ACM Transactions on Graphics (TOG), 43(4):1–18, 2024

    Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training- free consistent text-to-image generation.ACM Transactions on Graphics (TOG), 43(4):1–18, 2024

  57. [57]

    Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024

    Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024

  58. [58]

    Storymaker:Towardsholisticconsistent characters in text-to-image generation.arXiv preprint arXiv:2409.12576, 2024

    ZhengguangZhou,JingLi,HuaxiaLi,NemoChen,andXuTang. Storymaker:Towardsholisticconsistent characters in text-to-image generation.arXiv preprint arXiv:2409.12576, 2024

  59. [59]

    Infinite-story: A training-free consistent text-to-image generation

    Jihun Park, Kyoungmin Lee, Jongmin Gim, Hyeonseo Jo, Minseok Oh, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Minwoo Choi, and Sunghoon Im. Infinite-story: A training-free consistent text-to-image generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 8278–8286, 2026

  60. [60]

    Vinabench: Benchmark for faithful and consistent visual narratives

    Silin Gao, Sheryl Mathew, Li Mi, Sepideh Mamooler, Mengjie Zhao, Hiromi Wakaki, Yuki Mitsufuji, Syrielle Montariol, and Antoine Bosselut. Vinabench: Benchmark for faithful and consistent visual narratives. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2870–2879, 2025

  61. [61]

    R2i-bench: Benchmarking reasoning-driven text-to-image generation

    Kaijie Chen, Zihao Lin, Zhiyang Xu, Ying Shen, Yuguang Yao, Joy Rimchala, Jiaxin Zhang, and Lifu Huang. R2i-bench: Benchmarking reasoning-driven text-to-image generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12606–12641, 2025

  62. [62]

    Gir-bench: Versatile benchmark for generating images with reasoning

    Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Long Chen. Gir-bench: Versatile benchmark for generating images with reasoning. arXiv preprint arXiv:2510.11026, 2025

  63. [63]

    Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023

    Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023

  64. [64]

    T2v-compbench: A comprehensive benchmark for compositional text-to-video generation

    Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8406–8416, 2025

  65. [65]

    Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

  66. [66]

    T2vworldbench:Abenchmarkfor evaluating world knowledge in text-to-video generation

    YubinChen,XuyangGuo,ZhenmeiShi,ZhaoSong,andJiahaoZhang. T2vworldbench:Abenchmarkfor evaluating world knowledge in text-to-video generation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6474–6485, 2026

  67. [67]

    Videoverse: How far is your t2v generator from a world model?arXiv preprint arXiv:2510.08398, 2025

    ZeqingWang,XinyuWei,BairuiLi,ZhenGuo,JinruiZhang,HongyangWei,KezeWang,andLeiZhang. Videoverse: How far is your t2v generator from a world model?arXiv preprint arXiv:2510.08398, 2025. 2026.06 Preprint 20

  68. [68]

    Worldbench: Disambiguating physics for diagnostic evaluation of world models.arXiv preprint arXiv:2601.21282, 2026

    Rishi Upadhyay, Howard Zhang, Jim Solomon, Ayush Agrawal, Pranay Boreddy, Shruti Satya Narayana, Yunhao Ba, Alex Wong, Celso M de Melo, and Achuta Kadambi. Worldbench: Disambiguating physics for diagnostic evaluation of world models.arXiv preprint arXiv:2601.21282, 2026

  69. [69]

    Out of sight, out of mind? evaluating state evolution in video world models.arXiv preprint arXiv:2603.13215, 2026

    Ziqi Ma, Mengzhan Liufu, and Georgia Gkioxari. Out of sight, out of mind? evaluating state evolution in video world models.arXiv preprint arXiv:2603.13215, 2026

  70. [70]

    Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2019

    KexinYi,ChuangGan,YunzhuLi,PushmeetKohli,JiajunWu,AntonioTorralba,andJoshuaBTenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2019

  71. [71]

    Intphys: A framework and benchmark for visual intuitive physics reasoning

    Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, Véronique Izard, and Emmanuel Dupoux. Intphys: A framework and benchmark for visual intuitive physics reasoning. arXiv preprint arXiv:1803.07616, 2018

  72. [72]

    Physion: Evaluating physical prediction from vision in humans and machines.arXiv preprint arXiv:2106.08261, 2021

    Daniel M Bear, Elias Wang, Damian Mrowca, Felix J Binder, Hsiao-Yu Fish Tung, RT Pramod, Cameron Holdaway, Sirui Tao, Kevin Smith, Fan-Yun Sun, et al. Physion: Evaluating physical prediction from vision in humans and machines.arXiv preprint arXiv:2106.08261, 2021. 2026.06 Preprint 21 Appendix Contents The appendix contains dense qualitative grids, the com...