Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency
Pith reviewed 2026-06-27 13:32 UTC · model grok-4.3
The pith
Image generation models struggle to keep visual states consistent across ordered time steps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ImageTime introduces a four-keyframe protocol as a probe of visual world modeling, requiring image models to generate one image that depicts an initial state, action onset, transition state, and final state while obeying temporal constraints and avoiding causal violations, with scores assigned by a structured VLM judge.
What carries the argument
Four-keyframe generation task with stage-wise state predicates, cross-frame temporal constraints, and forbidden causal violations, evaluated under a VLM-as-judge protocol.
If this is right
- High-performing models on ImageTime can directly support storyboarding, step-by-step illustration, and reference-guided editing workflows.
- Diagnostic subscores isolate specific failure types such as identity drift or causal order violations across the four states.
- Progressive task hierarchy allows measurement of incremental improvements in temporal coherence without requiring full video generation.
Where Pith is reading between the lines
- The benchmark could be adapted to test whether models improve when given explicit causal chain instructions rather than single action prompts.
- Results may inform whether reference images help more with identity preservation than with transition logic.
Load-bearing premise
The GPT-5.5 VLM-as-judge protocol produces reliable, unbiased scores for spatiotemporal consistency and causal violations.
What would settle it
A systematic comparison showing frequent disagreement between human judges and GPT-5.5 scores on whether a generated image violates causal order or identity preservation would falsify the evaluation method.
read the original abstract
Image generation models now produce high-quality static images, yet their ability to represent how a visual world changes over time remains poorly understood. Practical workflows such as storyboarding, step-by-step illustration, reference-guided editing, and video previsualization require models to preserve identities, objects, spatial relations, and causal order across multiple visual states. Existing evaluations largely measure single-image correctness, compositional alignment, or video quality, leaving open whether an image model can coherently imagine a temporally ordered process. We introduce ImageTime, a diagnostic benchmark that uses spatiotemporal consistency as a behavioral probe of visual world modeling in image generation. Given an action instruction, and optionally a reference image specifying the initial state, a model must generate one image containing four ordered key states: initial state, action onset, transition state, and final state. This four-keyframe protocol is more temporally demanding than single-image generation while avoiding the confounds of dense video dynamics. ImageTime organizes tasks with a progressive capability hierarchy and decomposes each scenario into stage-wise state predicates, cross-frame temporal constraints, and forbidden causal violations. GPT-5.5 scores all generated images under a structured VLM-as-judge protocol, producing interpretable capability scores, diagnostic subscores, and failure labels. Through multi-family benchmarking, ImageTime reveals where current image generation systems succeed, fail, and drift when asked to maintain coherent visual world states over time.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ImageTime, a diagnostic benchmark for image generation models that requires generating a single image containing four ordered key states (initial state, action onset, transition state, final state) for a given action instruction (optionally with a reference image). Tasks are organized by a progressive capability hierarchy and decomposed into stage-wise predicates, temporal constraints, and forbidden violations; all outputs are scored for spatiotemporal consistency and causal violations via a structured GPT-5.5 VLM-as-judge protocol, with the goal of revealing success, failure, and drift patterns across model families in visual world modeling.
Significance. If the VLM-as-judge protocol proves reliable, ImageTime would address a genuine gap between single-image metrics and video evaluation by providing a lightweight yet temporally structured probe of identity preservation, spatial relations, and causal order. The four-keyframe format and explicit decomposition into predicates/constraints are methodologically sound ideas that could yield interpretable capability profiles.
major comments (1)
- [Abstract / VLM-as-judge protocol] Abstract (and the VLM-as-judge protocol description): the central claim that the benchmark 'reveals where current image generation systems succeed, fail, and drift' rests entirely on GPT-5.5 producing reliable scores and failure labels, yet no human calibration, inter-annotator agreement, or ablation against alternative judges is reported. This is load-bearing; without such validation the multi-family profiles risk being artifacts of the judge rather than measurements of the image models.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The major comment on VLM-as-judge validation is addressed below; we agree it is load-bearing and will strengthen the paper accordingly.
read point-by-point responses
-
Referee: [Abstract / VLM-as-judge protocol] Abstract (and the VLM-as-judge protocol description): the central claim that the benchmark 'reveals where current image generation systems succeed, fail, and drift' rests entirely on GPT-5.5 producing reliable scores and failure labels, yet no human calibration, inter-annotator agreement, or ablation against alternative judges is reported. This is load-bearing; without such validation the multi-family profiles risk being artifacts of the judge rather than measurements of the image models.
Authors: We agree that the absence of reported validation for the GPT-5.5 judge is a genuine limitation, as the benchmark's diagnostic claims depend on judge reliability. The manuscript describes the structured prompting protocol but does not include human calibration, agreement metrics, or judge ablations. In the revised version we will add a dedicated validation subsection: (1) human annotators will score a stratified sample of 200 images using the same predicate/constraint criteria, (2) we will report inter-annotator agreement (Cohen's kappa) among humans and between humans and GPT-5.5, and (3) we will ablate against an alternative judge (Claude-3.5-Sonnet) on the same sample. The abstract and methods will be updated to reference these results. This directly addresses the risk of judge artifacts. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces a new benchmark (ImageTime) and an evaluation protocol using a VLM-as-judge without any mathematical derivation chain, fitted parameters, or equations that reduce to inputs by construction. No self-definitional steps, fitted-input predictions, load-bearing self-citations, uniqueness theorems, or ansatzes are present in the abstract or described protocol. The central claim rests on empirical multi-family benchmarking rather than self-referential reductions, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VLM-as-judge (GPT-5.5) can accurately and consistently score spatiotemporal consistency, state predicates, and causal violations.
Reference graph
Works this paper leans on
-
[1]
Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024
2024
-
[2]
Worldscore: A unified evaluation benchmark for world generation
Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27713–27724, 2025
2025
-
[3]
Worldmodelbench: Judging video generation models as world models.Advances in Neural Information Processing Systems, 38, 2026
Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.Advances in Neural Information Processing Systems, 38, 2026
2026
-
[4]
Jingtong Yue, Ziqi Huang, Zhaoxi Chen, Xintao Wang, Pengfei Wan, and Ziwei Liu. Simulating the visual world with artificial intelligence: A roadmap.arXiv preprint arXiv:2511.08585, 2025
arXiv 2025
-
[5]
Videophy: Evaluating physical commonsense for video generation
Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation. InInternational Conference on Learning Representations, volume 2025, pages 102075–102121, 2025
2025
-
[6]
Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024
Pith/arXiv arXiv 2024
-
[7]
Weixi Feng, Jiachen Li, Michael Saxon, Tsu-jui Fu, Wenhu Chen, and William Yang Wang. Tc-bench: Benchmarking temporal compositionality in text-to-video and image-to-video generation.arXiv preprint arXiv:2406.08656, 2024
arXiv 2024
-
[8]
High-resolution image synthesis with latent diffusion models
RobinRombach,AndreasBlattmann,DominikLorenz,PatrickEsser,andBjörnOmmer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
2022
-
[9]
Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022
2022
-
[10]
Sdxl: Improving latent diffusion models for high-resolution image synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In International Conference on Learning Representations, volume 2024, pages 1862–1874, 2024
2024
-
[11]
Scalingrectifiedflowtransformersforhigh-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, DominikLorenz,AxelSauer,FredericBoesel,etal. Scalingrectifiedflowtransformersforhigh-resolution image synthesis. InForty-first international conference on machine learning, 2024
2024
-
[12]
Improvingimagegenerationwithbettercaptions.ComputerScience.https://cdn
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, JoyceLee,YufeiGuo,etal. Improvingimagegenerationwithbettercaptions.ComputerScience.https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023
2023
-
[13]
David Dinkevich, Matan Levy, Omri Avrahami, Dvir Samuel, and Dani Lischinski. Story2board: a training-free approach for expressive storyboard generation.arXiv preprint arXiv:2508.09983, 2025
arXiv 2025
-
[14]
Juanxi Tian, Siyuan Li, Conghui He, Lijun Wu, and Cheng Tan. Envision: Benchmarking unified understanding & generation for causal world process insights.arXiv preprint arXiv:2512.01816, 2025. 2026.06 Preprint 17
arXiv 2025
-
[15]
Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091, 2023
arXiv 2023
-
[16]
Junhao Cheng, Xi Lu, Hanhui Li, Khun Loun Zai, Baiqiao Yin, Yuhao Cheng, Yiqiang Yan, and Xiaodan Liang. Autostudio: Crafting consistent subjects in multi-turn interactive image generation.arXiv preprint arXiv:2406.01388, 2024
arXiv 2024
-
[17]
Multiref: Controllable image generation with multiple visual references
Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Peter Sushko, Gaoyang Jiang, Yao Wan, and Ranjay Krishna. Multiref: Controllable image generation with multiple visual references. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 13325–13331, 2025
2025
-
[18]
Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023
2023
-
[19]
T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023
Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023
2023
-
[20]
Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering
Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023
2023
-
[21]
Davidsonianscenegraph:Improvingreliabilityinfine-grainedevaluation for text-to-image generation
Jaemin Cho, Yushi Hu, Jason Baldridge, Roopal Garg, Peter Anderson, Ranjay Krishna, Mohit Bansal, JordiPont-Tuset,andSuWang. Davidsonianscenegraph:Improvingreliabilityinfine-grainedevaluation for text-to-image generation. InInternational conference on learning representations, volume 2024, pages 15625–15645, 2024
2024
-
[22]
Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yiran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, et al. Phybench: A physical commonsense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802, 2024
arXiv 2024
-
[23]
Revisiting text-to-image evaluation with gecko: on metrics, prompts, and human rating
Olivia Wiles, Chuhan Zhang, Isabela Albuquerque, Ivana Kajić, Su Wang, Emanuele Bugliarello, Yasumasa Onoe, Pinelopi Papalampidi, Ira Ktena, Christopher Knutsen, et al. Revisiting text-to-image evaluation with gecko: on metrics, prompts, and human rating. InInternational Conference on Learning Representations, volume 2025, pages 272–287, 2025
2025
-
[24]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024
2024
-
[25]
Evalcrafter: Benchmarking and evaluating large video generation models
Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139–22149, 2024
2024
-
[26]
Image as a world: Generating interactive world from single image via panoramic video generation.Advances in Neural Information Processing Systems, 38: 172611–172634, 2026
Dongnan Gui, Xun Guo, Wengang Zhou, and Yan Lu. Image as a world: Generating interactive world from single image via panoramic video generation.Advances in Neural Information Processing Systems, 38: 172611–172634, 2026
2026
-
[27]
A recipe for generating 3d worlds from a single image
Katja Schwarz, Denis Rozumny, Samuel Rota Bulò, Lorenzo Porzi, and Peter Kontschieder. A recipe for generating 3d worlds from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3520–3530, 2025
2025
-
[28]
XinluZhang,YujieLu,WeizhiWang,AnYan,JunYan,LiankeQin,HengWang,XifengYan,WilliamYang Wang, and Linda Ruth Petzold. Gpt-4v (ision) as a generalist evaluator for vision-language tasks.arXiv preprint arXiv:2311.01361, 2023
arXiv 2023
-
[29]
Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms
Jiarui Wang, Huiyu Duan, Yu Zhao, Juntong Wang, Guangtao Zhai, and Xiongkuo Min. Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17312–17323, 2025
2025
-
[30]
Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation
Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas Guibas, Dahua Lin, and Gordon Wetzstein. Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22227–22238, 2024. 2026.06 Preprint 18
2024
-
[31]
Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, et al. Imagenworld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks.arXiv preprint arXiv:2603.27862, 2026
arXiv 2026
-
[32]
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022
Pith/arXiv arXiv 2022
-
[33]
Evaluating and improving compositional text-to-visual generation
BaiqiLi,ZhiqiuLin,DeepakPathak,JiayaoLi,YixinFei,KewenWu,XideXia,PengchuanZhang,Graham Neubig, and Deva Ramanan. Evaluating and improving compositional text-to-visual generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5290–5301, 2024
2024
-
[34]
Holistic evaluation of text-to-image models
Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Teufel, Marco Bellagente, et al. Holistic evaluation of text-to-image models. Advances in Neural Information Processing Systems, 36:69981–70011, 2023
2023
-
[35]
Oneig-bench:Omni-dimensionalnuancedevaluationforimagegeneration.Advances in Neural Information Processing Systems, 38, 2026
Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, andHai-BaoChen. Oneig-bench:Omni-dimensionalnuancedevaluationforimagegeneration.Advances in Neural Information Processing Systems, 38, 2026
2026
-
[36]
Viescore: Towards explainable metrics for conditional image synthesis evaluation
Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12268–12290, 2024
2024
-
[37]
Evaluating text-to-visual generation with image-to-text generation
Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. InEuropean Conference on Computer Vision, pages 366–384. Springer, 2024
2024
-
[38]
Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023
2023
-
[39]
Pick-a-pic: Anopendatasetofuserpreferencesfortext-to-imagegeneration.Advancesinneuralinformationprocessing systems, 36:36652–36663, 2023
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: Anopendatasetofuserpreferencesfortext-to-imagegeneration.Advancesinneuralinformationprocessing systems, 36:36652–36663, 2023
2023
-
[40]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023
2023
-
[41]
Miao Hua, Jiawei Liu, Fei Ding, Wei Liu, Jie Wu, and Qian He. Dreamtuner: Single image is enough for subject-driven generation.arXiv preprint arXiv:2312.13691, 2023
arXiv 2023
-
[42]
Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.Advances in Neural Information Processing Systems, 36:30146–30166, 2023
Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.Advances in Neural Information Processing Systems, 36:30146–30166, 2023
2023
-
[43]
HuYe,JunZhang,SiboLiu,XiaoHan,andWeiYang. Ip-adapter:Textcompatibleimagepromptadapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023
Pith/arXiv arXiv 2023
-
[44]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023
2023
-
[45]
Dreambench++: A human-aligned benchmark for personalized image generation
Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Zheng Ge, Xiangyu Zhang, Shu-Tao Xia, et al. Dreambench++: A human-aligned benchmark for personalized image generation. In International Conference on Learning Representations, volume 2025, pages 46010–46032, 2025
2025
-
[46]
Zhenyu Hu, Qing Wang, Te Cao, Luo Liao, Longfei Lu, Liqun Liu, Shuang Li, Hang Chen, Mengge Xue, Yuan Chen, et al. Dsh-bench: A difficulty-and scenario-aware benchmark with hierarchical subject taxonomy for subject-driven text-to-image generation.arXiv preprint arXiv:2603.08090, 2026
Pith/arXiv arXiv 2026
-
[47]
FLUX.2: Frontier visual intelligence
Black Forest Labs. FLUX.2: Frontier visual intelligence. https://bfl.ai/blog/flux-2, 2025
2025
-
[48]
GPT Image 2 model
OpenAI. GPT Image 2 model. https://developers.openai.com/api/docs/models/gpt-image-2, 2026
2026
-
[49]
System card: ChatGPT Images 2.0 and thinking mode
OpenAI. System card: ChatGPT Images 2.0 and thinking mode. https://deploymentsafety.openai.com/ chatgpt-images-2-0/chatgpt-images-2-0.pdf, 2026. 2026.06 Preprint 19
2026
-
[50]
Gemini 3.1 Flash Image model card
Google DeepMind. Gemini 3.1 Flash Image model card. https://deepmind.google/models/model-car ds/gemini-3-1-flash-image/, 2026
2026
-
[51]
Deeper thinking, more accurate generation: Introducing Seedream 5.0 Lite
ByteDance Seed Team. Deeper thinking, more accurate generation: Introducing Seedream 5.0 Lite. https://seed.bytedance.com/en/blog/deeper-thinking-more-accurate-generation-introducing-seedr eam-5-0-lite, 2026
2026
-
[52]
Qwen-Image-2512
Qwen Team. Qwen-Image-2512. https://huggingface.co/Qwen/Qwen-Image-2512, 2026
2026
-
[53]
Qwen-image technical report, 2025
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...
Pith/arXiv arXiv 2025
-
[54]
HunyuanImage 2.1: An efficient diffusion model for high-resolution (2k) text-to-image generation
Tencent Hunyuan Team. HunyuanImage 2.1: An efficient diffusion model for high-resolution (2k) text-to-image generation. https://github.com/Tencent-Hunyuan/HunyuanImage-2.1, 2025
2025
-
[55]
Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025
Pith/arXiv arXiv 2025
-
[56]
Training- free consistent text-to-image generation.ACM Transactions on Graphics (TOG), 43(4):1–18, 2024
Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training- free consistent text-to-image generation.ACM Transactions on Graphics (TOG), 43(4):1–18, 2024
2024
-
[57]
Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024
Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024
2024
-
[58]
ZhengguangZhou,JingLi,HuaxiaLi,NemoChen,andXuTang. Storymaker:Towardsholisticconsistent characters in text-to-image generation.arXiv preprint arXiv:2409.12576, 2024
arXiv 2024
-
[59]
Infinite-story: A training-free consistent text-to-image generation
Jihun Park, Kyoungmin Lee, Jongmin Gim, Hyeonseo Jo, Minseok Oh, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Minwoo Choi, and Sunghoon Im. Infinite-story: A training-free consistent text-to-image generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 8278–8286, 2026
2026
-
[60]
Vinabench: Benchmark for faithful and consistent visual narratives
Silin Gao, Sheryl Mathew, Li Mi, Sepideh Mamooler, Mengjie Zhao, Hiromi Wakaki, Yuki Mitsufuji, Syrielle Montariol, and Antoine Bosselut. Vinabench: Benchmark for faithful and consistent visual narratives. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2870–2879, 2025
2025
-
[61]
R2i-bench: Benchmarking reasoning-driven text-to-image generation
Kaijie Chen, Zihao Lin, Zhiyang Xu, Ying Shen, Yuguang Yao, Joy Rimchala, Jiaxin Zhang, and Lifu Huang. R2i-bench: Benchmarking reasoning-driven text-to-image generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12606–12641, 2025
2025
-
[62]
Gir-bench: Versatile benchmark for generating images with reasoning
Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Long Chen. Gir-bench: Versatile benchmark for generating images with reasoning. arXiv preprint arXiv:2510.11026, 2025
arXiv 2025
-
[63]
Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023
Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023
2023
-
[64]
T2v-compbench: A comprehensive benchmark for compositional text-to-video generation
Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8406–8416, 2025
2025
-
[65]
Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025
Pith/arXiv arXiv 2025
-
[66]
T2vworldbench:Abenchmarkfor evaluating world knowledge in text-to-video generation
YubinChen,XuyangGuo,ZhenmeiShi,ZhaoSong,andJiahaoZhang. T2vworldbench:Abenchmarkfor evaluating world knowledge in text-to-video generation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6474–6485, 2026
2026
-
[67]
Videoverse: How far is your t2v generator from a world model?arXiv preprint arXiv:2510.08398, 2025
ZeqingWang,XinyuWei,BairuiLi,ZhenGuo,JinruiZhang,HongyangWei,KezeWang,andLeiZhang. Videoverse: How far is your t2v generator from a world model?arXiv preprint arXiv:2510.08398, 2025. 2026.06 Preprint 20
Pith/arXiv arXiv 2025
-
[68]
Rishi Upadhyay, Howard Zhang, Jim Solomon, Ayush Agrawal, Pranay Boreddy, Shruti Satya Narayana, Yunhao Ba, Alex Wong, Celso M de Melo, and Achuta Kadambi. Worldbench: Disambiguating physics for diagnostic evaluation of world models.arXiv preprint arXiv:2601.21282, 2026
arXiv 2026
-
[69]
Ziqi Ma, Mengzhan Liufu, and Georgia Gkioxari. Out of sight, out of mind? evaluating state evolution in video world models.arXiv preprint arXiv:2603.13215, 2026
arXiv 2026
-
[70]
KexinYi,ChuangGan,YunzhuLi,PushmeetKohli,JiajunWu,AntonioTorralba,andJoshuaBTenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2019
Pith/arXiv arXiv 1910
-
[71]
Intphys: A framework and benchmark for visual intuitive physics reasoning
Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, Véronique Izard, and Emmanuel Dupoux. Intphys: A framework and benchmark for visual intuitive physics reasoning. arXiv preprint arXiv:1803.07616, 2018
arXiv 2018
-
[72]
Daniel M Bear, Elias Wang, Damian Mrowca, Felix J Binder, Hsiao-Yu Fish Tung, RT Pramod, Cameron Holdaway, Sirui Tao, Kevin Smith, Fan-Yun Sun, et al. Physion: Evaluating physical prediction from vision in humans and machines.arXiv preprint arXiv:2106.08261, 2021. 2026.06 Preprint 21 Appendix Contents The appendix contains dense qualitative grids, the com...
arXiv 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.