Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency

Lichen Huang; Xinrui Wu

arxiv: 2606.10620 · v1 · pith:S3IEV2KPnew · submitted 2026-06-09 · 💻 cs.CV · cs.AI

Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency

Xinrui Wu , Lichen Huang This is my paper

Pith reviewed 2026-06-27 13:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords image generationbenchmarkspatiotemporal consistencyvisual world modelingtemporal reasoningkeyframe generationcausal consistency

0 comments

The pith

Image generation models struggle to keep visual states consistent across ordered time steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ImageTime, a benchmark that asks image generation models to produce a single image containing four ordered key states of an action: initial state, action onset, transition state, and final state. Models receive an action instruction and optionally a reference image, then must satisfy stage-wise predicates, cross-frame temporal constraints, and avoid forbidden causal violations. A structured VLM-as-judge protocol using GPT-5.5 produces capability scores, diagnostic subscores, and failure labels. Multi-family testing shows where models succeed or drift when asked to maintain identities, spatial relations, and causal order over time.

Core claim

ImageTime introduces a four-keyframe protocol as a probe of visual world modeling, requiring image models to generate one image that depicts an initial state, action onset, transition state, and final state while obeying temporal constraints and avoiding causal violations, with scores assigned by a structured VLM judge.

What carries the argument

Four-keyframe generation task with stage-wise state predicates, cross-frame temporal constraints, and forbidden causal violations, evaluated under a VLM-as-judge protocol.

If this is right

High-performing models on ImageTime can directly support storyboarding, step-by-step illustration, and reference-guided editing workflows.
Diagnostic subscores isolate specific failure types such as identity drift or causal order violations across the four states.
Progressive task hierarchy allows measurement of incremental improvements in temporal coherence without requiring full video generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be adapted to test whether models improve when given explicit causal chain instructions rather than single action prompts.
Results may inform whether reference images help more with identity preservation than with transition logic.

Load-bearing premise

The GPT-5.5 VLM-as-judge protocol produces reliable, unbiased scores for spatiotemporal consistency and causal violations.

What would settle it

A systematic comparison showing frequent disagreement between human judges and GPT-5.5 scores on whether a generated image violates causal order or identity preservation would falsify the evaluation method.

read the original abstract

Image generation models now produce high-quality static images, yet their ability to represent how a visual world changes over time remains poorly understood. Practical workflows such as storyboarding, step-by-step illustration, reference-guided editing, and video previsualization require models to preserve identities, objects, spatial relations, and causal order across multiple visual states. Existing evaluations largely measure single-image correctness, compositional alignment, or video quality, leaving open whether an image model can coherently imagine a temporally ordered process. We introduce ImageTime, a diagnostic benchmark that uses spatiotemporal consistency as a behavioral probe of visual world modeling in image generation. Given an action instruction, and optionally a reference image specifying the initial state, a model must generate one image containing four ordered key states: initial state, action onset, transition state, and final state. This four-keyframe protocol is more temporally demanding than single-image generation while avoiding the confounds of dense video dynamics. ImageTime organizes tasks with a progressive capability hierarchy and decomposes each scenario into stage-wise state predicates, cross-frame temporal constraints, and forbidden causal violations. GPT-5.5 scores all generated images under a structured VLM-as-judge protocol, producing interpretable capability scores, diagnostic subscores, and failure labels. Through multi-family benchmarking, ImageTime reveals where current image generation systems succeed, fail, and drift when asked to maintain coherent visual world states over time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ImageTime puts forward a four-keyframe benchmark for testing temporal consistency in image generators, but the scoring rests entirely on an unvalidated GPT-5.5 judge.

read the letter

The main takeaway is a new benchmark that asks image models to generate one image containing four ordered key states for a given action, then scores the result for consistency across those states. The protocol breaks each task into stage-wise predicates, cross-frame constraints, and forbidden violations, which is a clearer diagnostic structure than most single-image or video evals I have seen.

The four-keyframe design itself is the clearest piece of new work. It forces models to handle initial state, action start, transition, and end in one output without the full motion complexity of video. Organizing tasks by capability hierarchy and running the same protocol across multiple model families gives a practical way to map where different systems hold up or drift on identity, spatial relations, and causal order.

The soft spot is the evaluation method. All scores and failure labels come from a structured GPT-5.5 prompt with no reported human calibration, inter-annotator numbers, or ablation against other judges. If the VLM systematically misses or overflags certain violations, the success/failure profiles become judge artifacts rather than model measurements. The abstract presents the scores as reliable and interpretable, but that claim needs evidence that is not visible here.

This is aimed at people building or evaluating generative image and video systems who need finer-grained checks on sequential coherence. A reader working on storyboarding tools or reference-guided editing would get direct use from the task format and subscores.

I would send it to peer review. The benchmark protocol is distinct enough to be worth referee time, provided the authors add validation for the judge step. Without that, the results stay hard to trust.

Referee Report

1 major / 0 minor

Summary. The paper introduces ImageTime, a diagnostic benchmark for image generation models that requires generating a single image containing four ordered key states (initial state, action onset, transition state, final state) for a given action instruction (optionally with a reference image). Tasks are organized by a progressive capability hierarchy and decomposed into stage-wise predicates, temporal constraints, and forbidden violations; all outputs are scored for spatiotemporal consistency and causal violations via a structured GPT-5.5 VLM-as-judge protocol, with the goal of revealing success, failure, and drift patterns across model families in visual world modeling.

Significance. If the VLM-as-judge protocol proves reliable, ImageTime would address a genuine gap between single-image metrics and video evaluation by providing a lightweight yet temporally structured probe of identity preservation, spatial relations, and causal order. The four-keyframe format and explicit decomposition into predicates/constraints are methodologically sound ideas that could yield interpretable capability profiles.

major comments (1)

[Abstract / VLM-as-judge protocol] Abstract (and the VLM-as-judge protocol description): the central claim that the benchmark 'reveals where current image generation systems succeed, fail, and drift' rests entirely on GPT-5.5 producing reliable scores and failure labels, yet no human calibration, inter-annotator agreement, or ablation against alternative judges is reported. This is load-bearing; without such validation the multi-family profiles risk being artifacts of the judge rather than measurements of the image models.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The major comment on VLM-as-judge validation is addressed below; we agree it is load-bearing and will strengthen the paper accordingly.

read point-by-point responses

Referee: [Abstract / VLM-as-judge protocol] Abstract (and the VLM-as-judge protocol description): the central claim that the benchmark 'reveals where current image generation systems succeed, fail, and drift' rests entirely on GPT-5.5 producing reliable scores and failure labels, yet no human calibration, inter-annotator agreement, or ablation against alternative judges is reported. This is load-bearing; without such validation the multi-family profiles risk being artifacts of the judge rather than measurements of the image models.

Authors: We agree that the absence of reported validation for the GPT-5.5 judge is a genuine limitation, as the benchmark's diagnostic claims depend on judge reliability. The manuscript describes the structured prompting protocol but does not include human calibration, agreement metrics, or judge ablations. In the revised version we will add a dedicated validation subsection: (1) human annotators will score a stratified sample of 200 images using the same predicate/constraint criteria, (2) we will report inter-annotator agreement (Cohen's kappa) among humans and between humans and GPT-5.5, and (3) we will ablate against an alternative judge (Claude-3.5-Sonnet) on the same sample. The abstract and methods will be updated to reference these results. This directly addresses the risk of judge artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a new benchmark (ImageTime) and an evaluation protocol using a VLM-as-judge without any mathematical derivation chain, fitted parameters, or equations that reduce to inputs by construction. No self-definitional steps, fitted-input predictions, load-bearing self-citations, uniqueness theorems, or ansatzes are present in the abstract or described protocol. The central claim rests on empirical multi-family benchmarking rather than self-referential reductions, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmark paper with no mathematical derivations; the central claim rests on the design choices of the four-state protocol and the reliability of the GPT-5.5 judge, which are introduced without external validation in the abstract.

axioms (1)

domain assumption VLM-as-judge (GPT-5.5) can accurately and consistently score spatiotemporal consistency, state predicates, and causal violations.
Abstract states that GPT-5.5 scores all generated images under a structured VLM-as-judge protocol producing capability scores and failure labels.

pith-pipeline@v0.9.1-grok · 5786 in / 1056 out tokens · 16568 ms · 2026-06-27T13:32:57.048728+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 9 linked inside Pith

[1]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

2024
[2]

Worldscore: A unified evaluation benchmark for world generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27713–27724, 2025

2025
[3]

Worldmodelbench: Judging video generation models as world models.Advances in Neural Information Processing Systems, 38, 2026

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.Advances in Neural Information Processing Systems, 38, 2026

2026
[4]

Simulating the visual world with artificial intelligence: A roadmap.arXiv preprint arXiv:2511.08585, 2025

Jingtong Yue, Ziqi Huang, Zhaoxi Chen, Xintao Wang, Pengfei Wan, and Ziwei Liu. Simulating the visual world with artificial intelligence: A roadmap.arXiv preprint arXiv:2511.08585, 2025

arXiv 2025
[5]

Videophy: Evaluating physical commonsense for video generation

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation. InInternational Conference on Learning Representations, volume 2025, pages 102075–102121, 2025

2025
[6]

How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

Bingyi Kang, Yang Yue, Rui Lu, Zhĳie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

Pith/arXiv arXiv 2024
[7]

Tc-bench: Benchmarking temporal compositionality in text-to-video and image-to-video generation.arXiv preprint arXiv:2406.08656, 2024

Weixi Feng, Jiachen Li, Michael Saxon, Tsu-jui Fu, Wenhu Chen, and William Yang Wang. Tc-bench: Benchmarking temporal compositionality in text-to-video and image-to-video generation.arXiv preprint arXiv:2406.08656, 2024

arXiv 2024
[8]

High-resolution image synthesis with latent diffusion models

RobinRombach,AndreasBlattmann,DominikLorenz,PatrickEsser,andBjörnOmmer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[9]

Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontĳo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

2022
[10]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In International Conference on Learning Representations, volume 2024, pages 1862–1874, 2024

2024
[11]

Scalingrectifiedflowtransformersforhigh-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, DominikLorenz,AxelSauer,FredericBoesel,etal. Scalingrectifiedflowtransformersforhigh-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024
[12]

Improvingimagegenerationwithbettercaptions.ComputerScience.https://cdn

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, JoyceLee,YufeiGuo,etal. Improvingimagegenerationwithbettercaptions.ComputerScience.https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

2023
[13]

Story2board: a training-free approach for expressive storyboard generation.arXiv preprint arXiv:2508.09983, 2025

David Dinkevich, Matan Levy, Omri Avrahami, Dvir Samuel, and Dani Lischinski. Story2board: a training-free approach for expressive storyboard generation.arXiv preprint arXiv:2508.09983, 2025

arXiv 2025
[14]

Envision: Benchmarking unified understanding & generation for causal world process insights.arXiv preprint arXiv:2512.01816, 2025

Juanxi Tian, Siyuan Li, Conghui He, Lĳun Wu, and Cheng Tan. Envision: Benchmarking unified understanding & generation for causal world process insights.arXiv preprint arXiv:2512.01816, 2025. 2026.06 Preprint 17

arXiv 2025
[15]

Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091, 2023

Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091, 2023

arXiv 2023
[16]

Autostudio: Crafting consistent subjects in multi-turn interactive image generation.arXiv preprint arXiv:2406.01388, 2024

Junhao Cheng, Xi Lu, Hanhui Li, Khun Loun Zai, Baiqiao Yin, Yuhao Cheng, Yiqiang Yan, and Xiaodan Liang. Autostudio: Crafting consistent subjects in multi-turn interactive image generation.arXiv preprint arXiv:2406.01388, 2024

arXiv 2024
[17]

Multiref: Controllable image generation with multiple visual references

Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Peter Sushko, Gaoyang Jiang, Yao Wan, and Ranjay Krishna. Multiref: Controllable image generation with multiple visual references. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 13325–13331, 2025

2025
[18]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

2023
[19]

T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

2023
[20]

Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering

Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023

2023
[21]

Davidsonianscenegraph:Improvingreliabilityinfine-grainedevaluation for text-to-image generation

Jaemin Cho, Yushi Hu, Jason Baldridge, Roopal Garg, Peter Anderson, Ranjay Krishna, Mohit Bansal, JordiPont-Tuset,andSuWang. Davidsonianscenegraph:Improvingreliabilityinfine-grainedevaluation for text-to-image generation. InInternational conference on learning representations, volume 2024, pages 15625–15645, 2024

2024
[22]

Phybench: A physical commonsense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802, 2024

Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yiran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, et al. Phybench: A physical commonsense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802, 2024

arXiv 2024
[23]

Revisiting text-to-image evaluation with gecko: on metrics, prompts, and human rating

Olivia Wiles, Chuhan Zhang, Isabela Albuquerque, Ivana Kajić, Su Wang, Emanuele Bugliarello, Yasumasa Onoe, Pinelopi Papalampidi, Ira Ktena, Christopher Knutsen, et al. Revisiting text-to-image evaluation with gecko: on metrics, prompts, and human rating. InInternational Conference on Learning Representations, volume 2025, pages 272–287, 2025

2025
[24]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024
[25]

Evalcrafter: Benchmarking and evaluating large video generation models

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139–22149, 2024

2024
[26]

Image as a world: Generating interactive world from single image via panoramic video generation.Advances in Neural Information Processing Systems, 38: 172611–172634, 2026

Dongnan Gui, Xun Guo, Wengang Zhou, and Yan Lu. Image as a world: Generating interactive world from single image via panoramic video generation.Advances in Neural Information Processing Systems, 38: 172611–172634, 2026

2026
[27]

A recipe for generating 3d worlds from a single image

Katja Schwarz, Denis Rozumny, Samuel Rota Bulò, Lorenzo Porzi, and Peter Kontschieder. A recipe for generating 3d worlds from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3520–3530, 2025

2025
[28]

Gpt-4v (ision) as a generalist evaluator for vision-language tasks.arXiv preprint arXiv:2311.01361, 2023

XinluZhang,YujieLu,WeizhiWang,AnYan,JunYan,LiankeQin,HengWang,XifengYan,WilliamYang Wang, and Linda Ruth Petzold. Gpt-4v (ision) as a generalist evaluator for vision-language tasks.arXiv preprint arXiv:2311.01361, 2023

arXiv 2023
[29]

Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms

Jiarui Wang, Huiyu Duan, Yu Zhao, Juntong Wang, Guangtao Zhai, and Xiongkuo Min. Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17312–17323, 2025

2025
[30]

Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation

Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas Guibas, Dahua Lin, and Gordon Wetzstein. Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22227–22238, 2024. 2026.06 Preprint 18

2024
[31]

Imagenworld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks.arXiv preprint arXiv:2603.27862, 2026

Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, et al. Imagenworld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks.arXiv preprint arXiv:2603.27862, 2026

arXiv 2026
[32]

Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vĳay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022

Pith/arXiv arXiv 2022
[33]

Evaluating and improving compositional text-to-visual generation

BaiqiLi,ZhiqiuLin,DeepakPathak,JiayaoLi,YixinFei,KewenWu,XideXia,PengchuanZhang,Graham Neubig, and Deva Ramanan. Evaluating and improving compositional text-to-visual generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5290–5301, 2024

2024
[34]

Holistic evaluation of text-to-image models

Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Teufel, Marco Bellagente, et al. Holistic evaluation of text-to-image models. Advances in Neural Information Processing Systems, 36:69981–70011, 2023

2023
[35]

Oneig-bench:Omni-dimensionalnuancedevaluationforimagegeneration.Advances in Neural Information Processing Systems, 38, 2026

Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, andHai-BaoChen. Oneig-bench:Omni-dimensionalnuancedevaluationforimagegeneration.Advances in Neural Information Processing Systems, 38, 2026

2026
[36]

Viescore: Towards explainable metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12268–12290, 2024

2024
[37]

Evaluating text-to-visual generation with image-to-text generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. InEuropean Conference on Computer Vision, pages 366–384. Springer, 2024

2024
[38]

Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023

2023
[39]

Pick-a-pic: Anopendatasetofuserpreferencesfortext-to-imagegeneration.Advancesinneuralinformationprocessing systems, 36:36652–36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: Anopendatasetofuserpreferencesfortext-to-imagegeneration.Advancesinneuralinformationprocessing systems, 36:36652–36663, 2023

2023
[40]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

2023
[41]

Dreamtuner: Single image is enough for subject-driven generation.arXiv preprint arXiv:2312.13691, 2023

Miao Hua, Jiawei Liu, Fei Ding, Wei Liu, Jie Wu, and Qian He. Dreamtuner: Single image is enough for subject-driven generation.arXiv preprint arXiv:2312.13691, 2023

arXiv 2023
[42]

Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.Advances in Neural Information Processing Systems, 36:30146–30166, 2023

Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.Advances in Neural Information Processing Systems, 36:30146–30166, 2023

2023
[43]

Ip-adapter:Textcompatibleimagepromptadapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

HuYe,JunZhang,SiboLiu,XiaoHan,andWeiYang. Ip-adapter:Textcompatibleimagepromptadapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

Pith/arXiv arXiv 2023
[44]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

2023
[45]

Dreambench++: A human-aligned benchmark for personalized image generation

Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Zheng Ge, Xiangyu Zhang, Shu-Tao Xia, et al. Dreambench++: A human-aligned benchmark for personalized image generation. In International Conference on Learning Representations, volume 2025, pages 46010–46032, 2025

2025
[46]

Dsh-bench: A difficulty-and scenario-aware benchmark with hierarchical subject taxonomy for subject-driven text-to-image generation.arXiv preprint arXiv:2603.08090, 2026

Zhenyu Hu, Qing Wang, Te Cao, Luo Liao, Longfei Lu, Liqun Liu, Shuang Li, Hang Chen, Mengge Xue, Yuan Chen, et al. Dsh-bench: A difficulty-and scenario-aware benchmark with hierarchical subject taxonomy for subject-driven text-to-image generation.arXiv preprint arXiv:2603.08090, 2026

Pith/arXiv arXiv 2026
[47]

FLUX.2: Frontier visual intelligence

Black Forest Labs. FLUX.2: Frontier visual intelligence. https://bfl.ai/blog/flux-2, 2025

2025
[48]

GPT Image 2 model

OpenAI. GPT Image 2 model. https://developers.openai.com/api/docs/models/gpt-image-2, 2026

2026
[49]

System card: ChatGPT Images 2.0 and thinking mode

OpenAI. System card: ChatGPT Images 2.0 and thinking mode. https://deploymentsafety.openai.com/ chatgpt-images-2-0/chatgpt-images-2-0.pdf, 2026. 2026.06 Preprint 19

2026
[50]

Gemini 3.1 Flash Image model card

Google DeepMind. Gemini 3.1 Flash Image model card. https://deepmind.google/models/model-car ds/gemini-3-1-flash-image/, 2026

2026
[51]

Deeper thinking, more accurate generation: Introducing Seedream 5.0 Lite

ByteDance Seed Team. Deeper thinking, more accurate generation: Introducing Seedream 5.0 Lite. https://seed.bytedance.com/en/blog/deeper-thinking-more-accurate-generation-introducing-seedr eam-5-0-lite, 2026

2026
[52]

Qwen-Image-2512

Qwen Team. Qwen-Image-2512. https://huggingface.co/Qwen/Qwen-Image-2512, 2026

2026
[53]

Qwen-image technical report, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

Pith/arXiv arXiv 2025
[54]

HunyuanImage 2.1: An efficient diffusion model for high-resolution (2k) text-to-image generation

Tencent Hunyuan Team. HunyuanImage 2.1: An efficient diffusion model for high-resolution (2k) text-to-image generation. https://github.com/Tencent-Hunyuan/HunyuanImage-2.1, 2025

2025
[55]

Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shĳie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

Pith/arXiv arXiv 2025
[56]

Training- free consistent text-to-image generation.ACM Transactions on Graphics (TOG), 43(4):1–18, 2024

Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training- free consistent text-to-image generation.ACM Transactions on Graphics (TOG), 43(4):1–18, 2024

2024
[57]

Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024

Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024

2024
[58]

Storymaker:Towardsholisticconsistent characters in text-to-image generation.arXiv preprint arXiv:2409.12576, 2024

ZhengguangZhou,JingLi,HuaxiaLi,NemoChen,andXuTang. Storymaker:Towardsholisticconsistent characters in text-to-image generation.arXiv preprint arXiv:2409.12576, 2024

arXiv 2024
[59]

Infinite-story: A training-free consistent text-to-image generation

Jihun Park, Kyoungmin Lee, Jongmin Gim, Hyeonseo Jo, Minseok Oh, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Minwoo Choi, and Sunghoon Im. Infinite-story: A training-free consistent text-to-image generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 8278–8286, 2026

2026
[60]

Vinabench: Benchmark for faithful and consistent visual narratives

Silin Gao, Sheryl Mathew, Li Mi, Sepideh Mamooler, Mengjie Zhao, Hiromi Wakaki, Yuki Mitsufuji, Syrielle Montariol, and Antoine Bosselut. Vinabench: Benchmark for faithful and consistent visual narratives. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2870–2879, 2025

2025
[61]

R2i-bench: Benchmarking reasoning-driven text-to-image generation

Kaĳie Chen, Zihao Lin, Zhiyang Xu, Ying Shen, Yuguang Yao, Joy Rimchala, Jiaxin Zhang, and Lifu Huang. R2i-bench: Benchmarking reasoning-driven text-to-image generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12606–12641, 2025

2025
[62]

Gir-bench: Versatile benchmark for generating images with reasoning

Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Long Chen. Gir-bench: Versatile benchmark for generating images with reasoning. arXiv preprint arXiv:2510.11026, 2025

arXiv 2025
[63]

Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023

Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023

2023
[64]

T2v-compbench: A comprehensive benchmark for compositional text-to-video generation

Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8406–8416, 2025

2025
[65]

Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

Pith/arXiv arXiv 2025
[66]

T2vworldbench:Abenchmarkfor evaluating world knowledge in text-to-video generation

YubinChen,XuyangGuo,ZhenmeiShi,ZhaoSong,andJiahaoZhang. T2vworldbench:Abenchmarkfor evaluating world knowledge in text-to-video generation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6474–6485, 2026

2026
[67]

Videoverse: How far is your t2v generator from a world model?arXiv preprint arXiv:2510.08398, 2025

ZeqingWang,XinyuWei,BairuiLi,ZhenGuo,JinruiZhang,HongyangWei,KezeWang,andLeiZhang. Videoverse: How far is your t2v generator from a world model?arXiv preprint arXiv:2510.08398, 2025. 2026.06 Preprint 20

Pith/arXiv arXiv 2025
[68]

Worldbench: Disambiguating physics for diagnostic evaluation of world models.arXiv preprint arXiv:2601.21282, 2026

Rishi Upadhyay, Howard Zhang, Jim Solomon, Ayush Agrawal, Pranay Boreddy, Shruti Satya Narayana, Yunhao Ba, Alex Wong, Celso M de Melo, and Achuta Kadambi. Worldbench: Disambiguating physics for diagnostic evaluation of world models.arXiv preprint arXiv:2601.21282, 2026

arXiv 2026
[69]

Out of sight, out of mind? evaluating state evolution in video world models.arXiv preprint arXiv:2603.13215, 2026

Ziqi Ma, Mengzhan Liufu, and Georgia Gkioxari. Out of sight, out of mind? evaluating state evolution in video world models.arXiv preprint arXiv:2603.13215, 2026

arXiv 2026
[70]

Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2019

KexinYi,ChuangGan,YunzhuLi,PushmeetKohli,JiajunWu,AntonioTorralba,andJoshuaBTenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2019

Pith/arXiv arXiv 1910
[71]

Intphys: A framework and benchmark for visual intuitive physics reasoning

Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, Véronique Izard, and Emmanuel Dupoux. Intphys: A framework and benchmark for visual intuitive physics reasoning. arXiv preprint arXiv:1803.07616, 2018

arXiv 2018
[72]

Physion: Evaluating physical prediction from vision in humans and machines.arXiv preprint arXiv:2106.08261, 2021

Daniel M Bear, Elias Wang, Damian Mrowca, Felix J Binder, Hsiao-Yu Fish Tung, RT Pramod, Cameron Holdaway, Sirui Tao, Kevin Smith, Fan-Yun Sun, et al. Physion: Evaluating physical prediction from vision in humans and machines.arXiv preprint arXiv:2106.08261, 2021. 2026.06 Preprint 21 Appendix Contents The appendix contains dense qualitative grids, the com...

arXiv 2021

[1] [1]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

2024

[2] [2]

Worldscore: A unified evaluation benchmark for world generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27713–27724, 2025

2025

[3] [3]

Worldmodelbench: Judging video generation models as world models.Advances in Neural Information Processing Systems, 38, 2026

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.Advances in Neural Information Processing Systems, 38, 2026

2026

[4] [4]

Simulating the visual world with artificial intelligence: A roadmap.arXiv preprint arXiv:2511.08585, 2025

Jingtong Yue, Ziqi Huang, Zhaoxi Chen, Xintao Wang, Pengfei Wan, and Ziwei Liu. Simulating the visual world with artificial intelligence: A roadmap.arXiv preprint arXiv:2511.08585, 2025

arXiv 2025

[5] [5]

Videophy: Evaluating physical commonsense for video generation

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation. InInternational Conference on Learning Representations, volume 2025, pages 102075–102121, 2025

2025

[6] [6]

How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

Bingyi Kang, Yang Yue, Rui Lu, Zhĳie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

Pith/arXiv arXiv 2024

[7] [7]

Tc-bench: Benchmarking temporal compositionality in text-to-video and image-to-video generation.arXiv preprint arXiv:2406.08656, 2024

Weixi Feng, Jiachen Li, Michael Saxon, Tsu-jui Fu, Wenhu Chen, and William Yang Wang. Tc-bench: Benchmarking temporal compositionality in text-to-video and image-to-video generation.arXiv preprint arXiv:2406.08656, 2024

arXiv 2024

[8] [8]

High-resolution image synthesis with latent diffusion models

RobinRombach,AndreasBlattmann,DominikLorenz,PatrickEsser,andBjörnOmmer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022

[9] [9]

Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontĳo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

2022

[10] [10]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In International Conference on Learning Representations, volume 2024, pages 1862–1874, 2024

2024

[11] [11]

Scalingrectifiedflowtransformersforhigh-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, DominikLorenz,AxelSauer,FredericBoesel,etal. Scalingrectifiedflowtransformersforhigh-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024

[12] [12]

Improvingimagegenerationwithbettercaptions.ComputerScience.https://cdn

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, JoyceLee,YufeiGuo,etal. Improvingimagegenerationwithbettercaptions.ComputerScience.https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

2023

[13] [13]

Story2board: a training-free approach for expressive storyboard generation.arXiv preprint arXiv:2508.09983, 2025

David Dinkevich, Matan Levy, Omri Avrahami, Dvir Samuel, and Dani Lischinski. Story2board: a training-free approach for expressive storyboard generation.arXiv preprint arXiv:2508.09983, 2025

arXiv 2025

[14] [14]

Envision: Benchmarking unified understanding & generation for causal world process insights.arXiv preprint arXiv:2512.01816, 2025

Juanxi Tian, Siyuan Li, Conghui He, Lĳun Wu, and Cheng Tan. Envision: Benchmarking unified understanding & generation for causal world process insights.arXiv preprint arXiv:2512.01816, 2025. 2026.06 Preprint 17

arXiv 2025

[15] [15]

Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091, 2023

Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091, 2023

arXiv 2023

[16] [16]

Autostudio: Crafting consistent subjects in multi-turn interactive image generation.arXiv preprint arXiv:2406.01388, 2024

Junhao Cheng, Xi Lu, Hanhui Li, Khun Loun Zai, Baiqiao Yin, Yuhao Cheng, Yiqiang Yan, and Xiaodan Liang. Autostudio: Crafting consistent subjects in multi-turn interactive image generation.arXiv preprint arXiv:2406.01388, 2024

arXiv 2024

[17] [17]

Multiref: Controllable image generation with multiple visual references

Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Peter Sushko, Gaoyang Jiang, Yao Wan, and Ranjay Krishna. Multiref: Controllable image generation with multiple visual references. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 13325–13331, 2025

2025

[18] [18]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

2023

[19] [19]

T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

2023

[20] [20]

Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering

Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023

2023

[21] [21]

Davidsonianscenegraph:Improvingreliabilityinfine-grainedevaluation for text-to-image generation

Jaemin Cho, Yushi Hu, Jason Baldridge, Roopal Garg, Peter Anderson, Ranjay Krishna, Mohit Bansal, JordiPont-Tuset,andSuWang. Davidsonianscenegraph:Improvingreliabilityinfine-grainedevaluation for text-to-image generation. InInternational conference on learning representations, volume 2024, pages 15625–15645, 2024

2024

[22] [22]

Phybench: A physical commonsense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802, 2024

Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yiran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, et al. Phybench: A physical commonsense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802, 2024

arXiv 2024

[23] [23]

Revisiting text-to-image evaluation with gecko: on metrics, prompts, and human rating

Olivia Wiles, Chuhan Zhang, Isabela Albuquerque, Ivana Kajić, Su Wang, Emanuele Bugliarello, Yasumasa Onoe, Pinelopi Papalampidi, Ira Ktena, Christopher Knutsen, et al. Revisiting text-to-image evaluation with gecko: on metrics, prompts, and human rating. InInternational Conference on Learning Representations, volume 2025, pages 272–287, 2025

2025

[24] [24]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024

[25] [25]

Evalcrafter: Benchmarking and evaluating large video generation models

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139–22149, 2024

2024

[26] [26]

Image as a world: Generating interactive world from single image via panoramic video generation.Advances in Neural Information Processing Systems, 38: 172611–172634, 2026

Dongnan Gui, Xun Guo, Wengang Zhou, and Yan Lu. Image as a world: Generating interactive world from single image via panoramic video generation.Advances in Neural Information Processing Systems, 38: 172611–172634, 2026

2026

[27] [27]

A recipe for generating 3d worlds from a single image

Katja Schwarz, Denis Rozumny, Samuel Rota Bulò, Lorenzo Porzi, and Peter Kontschieder. A recipe for generating 3d worlds from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3520–3530, 2025

2025

[28] [28]

Gpt-4v (ision) as a generalist evaluator for vision-language tasks.arXiv preprint arXiv:2311.01361, 2023

XinluZhang,YujieLu,WeizhiWang,AnYan,JunYan,LiankeQin,HengWang,XifengYan,WilliamYang Wang, and Linda Ruth Petzold. Gpt-4v (ision) as a generalist evaluator for vision-language tasks.arXiv preprint arXiv:2311.01361, 2023

arXiv 2023

[29] [29]

Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms

Jiarui Wang, Huiyu Duan, Yu Zhao, Juntong Wang, Guangtao Zhai, and Xiongkuo Min. Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17312–17323, 2025

2025

[30] [30]

Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation

Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas Guibas, Dahua Lin, and Gordon Wetzstein. Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22227–22238, 2024. 2026.06 Preprint 18

2024

[31] [31]

Imagenworld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks.arXiv preprint arXiv:2603.27862, 2026

Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, et al. Imagenworld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks.arXiv preprint arXiv:2603.27862, 2026

arXiv 2026

[32] [32]

Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vĳay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022

Pith/arXiv arXiv 2022

[33] [33]

Evaluating and improving compositional text-to-visual generation

BaiqiLi,ZhiqiuLin,DeepakPathak,JiayaoLi,YixinFei,KewenWu,XideXia,PengchuanZhang,Graham Neubig, and Deva Ramanan. Evaluating and improving compositional text-to-visual generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5290–5301, 2024

2024

[34] [34]

Holistic evaluation of text-to-image models

Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Teufel, Marco Bellagente, et al. Holistic evaluation of text-to-image models. Advances in Neural Information Processing Systems, 36:69981–70011, 2023

2023

[35] [35]

Oneig-bench:Omni-dimensionalnuancedevaluationforimagegeneration.Advances in Neural Information Processing Systems, 38, 2026

Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, andHai-BaoChen. Oneig-bench:Omni-dimensionalnuancedevaluationforimagegeneration.Advances in Neural Information Processing Systems, 38, 2026

2026

[36] [36]

Viescore: Towards explainable metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12268–12290, 2024

2024

[37] [37]

Evaluating text-to-visual generation with image-to-text generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. InEuropean Conference on Computer Vision, pages 366–384. Springer, 2024

2024

[38] [38]

Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023

2023

[39] [39]

Pick-a-pic: Anopendatasetofuserpreferencesfortext-to-imagegeneration.Advancesinneuralinformationprocessing systems, 36:36652–36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: Anopendatasetofuserpreferencesfortext-to-imagegeneration.Advancesinneuralinformationprocessing systems, 36:36652–36663, 2023

2023

[40] [40]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

2023

[41] [41]

Dreamtuner: Single image is enough for subject-driven generation.arXiv preprint arXiv:2312.13691, 2023

Miao Hua, Jiawei Liu, Fei Ding, Wei Liu, Jie Wu, and Qian He. Dreamtuner: Single image is enough for subject-driven generation.arXiv preprint arXiv:2312.13691, 2023

arXiv 2023

[42] [42]

Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.Advances in Neural Information Processing Systems, 36:30146–30166, 2023

Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.Advances in Neural Information Processing Systems, 36:30146–30166, 2023

2023

[43] [43]

Ip-adapter:Textcompatibleimagepromptadapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

HuYe,JunZhang,SiboLiu,XiaoHan,andWeiYang. Ip-adapter:Textcompatibleimagepromptadapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

Pith/arXiv arXiv 2023

[44] [44]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

2023

[45] [45]

Dreambench++: A human-aligned benchmark for personalized image generation

Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Zheng Ge, Xiangyu Zhang, Shu-Tao Xia, et al. Dreambench++: A human-aligned benchmark for personalized image generation. In International Conference on Learning Representations, volume 2025, pages 46010–46032, 2025

2025

[46] [46]

Dsh-bench: A difficulty-and scenario-aware benchmark with hierarchical subject taxonomy for subject-driven text-to-image generation.arXiv preprint arXiv:2603.08090, 2026

Zhenyu Hu, Qing Wang, Te Cao, Luo Liao, Longfei Lu, Liqun Liu, Shuang Li, Hang Chen, Mengge Xue, Yuan Chen, et al. Dsh-bench: A difficulty-and scenario-aware benchmark with hierarchical subject taxonomy for subject-driven text-to-image generation.arXiv preprint arXiv:2603.08090, 2026

Pith/arXiv arXiv 2026

[47] [47]

FLUX.2: Frontier visual intelligence

Black Forest Labs. FLUX.2: Frontier visual intelligence. https://bfl.ai/blog/flux-2, 2025

2025

[48] [48]

GPT Image 2 model

OpenAI. GPT Image 2 model. https://developers.openai.com/api/docs/models/gpt-image-2, 2026

2026

[49] [49]

System card: ChatGPT Images 2.0 and thinking mode

OpenAI. System card: ChatGPT Images 2.0 and thinking mode. https://deploymentsafety.openai.com/ chatgpt-images-2-0/chatgpt-images-2-0.pdf, 2026. 2026.06 Preprint 19

2026

[50] [50]

Gemini 3.1 Flash Image model card

Google DeepMind. Gemini 3.1 Flash Image model card. https://deepmind.google/models/model-car ds/gemini-3-1-flash-image/, 2026

2026

[51] [51]

Deeper thinking, more accurate generation: Introducing Seedream 5.0 Lite

ByteDance Seed Team. Deeper thinking, more accurate generation: Introducing Seedream 5.0 Lite. https://seed.bytedance.com/en/blog/deeper-thinking-more-accurate-generation-introducing-seedr eam-5-0-lite, 2026

2026

[52] [52]

Qwen-Image-2512

Qwen Team. Qwen-Image-2512. https://huggingface.co/Qwen/Qwen-Image-2512, 2026

2026

[53] [53]

Qwen-image technical report, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

Pith/arXiv arXiv 2025

[54] [54]

HunyuanImage 2.1: An efficient diffusion model for high-resolution (2k) text-to-image generation

Tencent Hunyuan Team. HunyuanImage 2.1: An efficient diffusion model for high-resolution (2k) text-to-image generation. https://github.com/Tencent-Hunyuan/HunyuanImage-2.1, 2025

2025

[55] [55]

Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shĳie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

Pith/arXiv arXiv 2025

[56] [56]

Training- free consistent text-to-image generation.ACM Transactions on Graphics (TOG), 43(4):1–18, 2024

Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training- free consistent text-to-image generation.ACM Transactions on Graphics (TOG), 43(4):1–18, 2024

2024

[57] [57]

Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024

Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024

2024

[58] [58]

Storymaker:Towardsholisticconsistent characters in text-to-image generation.arXiv preprint arXiv:2409.12576, 2024

ZhengguangZhou,JingLi,HuaxiaLi,NemoChen,andXuTang. Storymaker:Towardsholisticconsistent characters in text-to-image generation.arXiv preprint arXiv:2409.12576, 2024

arXiv 2024

[59] [59]

Infinite-story: A training-free consistent text-to-image generation

Jihun Park, Kyoungmin Lee, Jongmin Gim, Hyeonseo Jo, Minseok Oh, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Minwoo Choi, and Sunghoon Im. Infinite-story: A training-free consistent text-to-image generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 8278–8286, 2026

2026

[60] [60]

Vinabench: Benchmark for faithful and consistent visual narratives

Silin Gao, Sheryl Mathew, Li Mi, Sepideh Mamooler, Mengjie Zhao, Hiromi Wakaki, Yuki Mitsufuji, Syrielle Montariol, and Antoine Bosselut. Vinabench: Benchmark for faithful and consistent visual narratives. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2870–2879, 2025

2025

[61] [61]

R2i-bench: Benchmarking reasoning-driven text-to-image generation

Kaĳie Chen, Zihao Lin, Zhiyang Xu, Ying Shen, Yuguang Yao, Joy Rimchala, Jiaxin Zhang, and Lifu Huang. R2i-bench: Benchmarking reasoning-driven text-to-image generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12606–12641, 2025

2025

[62] [62]

Gir-bench: Versatile benchmark for generating images with reasoning

Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Long Chen. Gir-bench: Versatile benchmark for generating images with reasoning. arXiv preprint arXiv:2510.11026, 2025

arXiv 2025

[63] [63]

Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023

Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023

2023

[64] [64]

T2v-compbench: A comprehensive benchmark for compositional text-to-video generation

Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8406–8416, 2025

2025

[65] [65]

Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

Pith/arXiv arXiv 2025

[66] [66]

T2vworldbench:Abenchmarkfor evaluating world knowledge in text-to-video generation

YubinChen,XuyangGuo,ZhenmeiShi,ZhaoSong,andJiahaoZhang. T2vworldbench:Abenchmarkfor evaluating world knowledge in text-to-video generation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6474–6485, 2026

2026

[67] [67]

Videoverse: How far is your t2v generator from a world model?arXiv preprint arXiv:2510.08398, 2025

ZeqingWang,XinyuWei,BairuiLi,ZhenGuo,JinruiZhang,HongyangWei,KezeWang,andLeiZhang. Videoverse: How far is your t2v generator from a world model?arXiv preprint arXiv:2510.08398, 2025. 2026.06 Preprint 20

Pith/arXiv arXiv 2025

[68] [68]

Worldbench: Disambiguating physics for diagnostic evaluation of world models.arXiv preprint arXiv:2601.21282, 2026

Rishi Upadhyay, Howard Zhang, Jim Solomon, Ayush Agrawal, Pranay Boreddy, Shruti Satya Narayana, Yunhao Ba, Alex Wong, Celso M de Melo, and Achuta Kadambi. Worldbench: Disambiguating physics for diagnostic evaluation of world models.arXiv preprint arXiv:2601.21282, 2026

arXiv 2026

[69] [69]

Out of sight, out of mind? evaluating state evolution in video world models.arXiv preprint arXiv:2603.13215, 2026

Ziqi Ma, Mengzhan Liufu, and Georgia Gkioxari. Out of sight, out of mind? evaluating state evolution in video world models.arXiv preprint arXiv:2603.13215, 2026

arXiv 2026

[70] [70]

Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2019

KexinYi,ChuangGan,YunzhuLi,PushmeetKohli,JiajunWu,AntonioTorralba,andJoshuaBTenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2019

Pith/arXiv arXiv 1910

[71] [71]

Intphys: A framework and benchmark for visual intuitive physics reasoning

Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, Véronique Izard, and Emmanuel Dupoux. Intphys: A framework and benchmark for visual intuitive physics reasoning. arXiv preprint arXiv:1803.07616, 2018

arXiv 2018

[72] [72]

Physion: Evaluating physical prediction from vision in humans and machines.arXiv preprint arXiv:2106.08261, 2021

Daniel M Bear, Elias Wang, Damian Mrowca, Felix J Binder, Hsiao-Yu Fish Tung, RT Pramod, Cameron Holdaway, Sirui Tao, Kevin Smith, Fan-Yun Sun, et al. Physion: Evaluating physical prediction from vision in humans and machines.arXiv preprint arXiv:2106.08261, 2021. 2026.06 Preprint 21 Appendix Contents The appendix contains dense qualitative grids, the com...

arXiv 2021