Recognition: unknown
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Pith reviewed 2026-05-07 06:33 UTC · model grok-4.3
The pith
Visual generation must advance from basic image synthesis to agentic world modeling that respects structure, dynamics, and causal relations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that visual generation evolves along five distinct capability levels—Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation—moving from passive renderers to interactive, agentic, world-aware generators. Progress requires integrating techniques like flow matching, unified understanding-generation architectures, improved representations, reward modeling, and synthetic data, while current evaluations systematically overestimate advancement by overlooking structural, temporal, and causal errors revealed through stress tests and expert case studies.
What carries the argument
A five-level taxonomy that classifies visual generation systems by increasing degrees of intelligence, from basic atomic mapping of pixels to full agentic world modeling that maintains persistent states, causal relations, and interactive dynamics.
If this is right
- Models will require unified architectures that jointly handle visual understanding and generation rather than separate pipelines.
- Training will shift emphasis toward data curation, synthetic data distillation, and reward modeling that penalize causal inconsistencies.
- Evaluation protocols must incorporate stress tests for spatial reasoning and long-term temporal coherence beyond single-frame quality.
- Sampling acceleration techniques will become essential to support real-time agentic interaction and iterative world updates.
- Post-training methods will focus on aligning outputs with domain knowledge and physical plausibility constraints.
Where Pith is reading between the lines
- Adopting the taxonomy could redirect funding and benchmarks away from single-image photorealism toward simulators that support planning and intervention.
- The same progression might apply to other modalities such as audio or 3D scene generation, suggesting a general pattern for multimodal world models.
- If world-modeling generation succeeds, downstream applications like robotics simulation and video game engines could merge with generative AI systems.
- The framework implies testable predictions about which technical drivers most accelerate movement between levels.
Load-bearing premise
The five-level taxonomy correctly captures the necessary stages of evolution in the field and that standard evaluations fail to detect structural, temporal, and causal shortcomings.
What would settle it
A benchmark suite that scores current top models on long-horizon causal consistency and persistent state tracking in generated scenes, where performance drops sharply compared with existing perceptual metrics.
read the original abstract
Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation models, improved visual representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. We further show that current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures. By combining benchmark review, in-the-wild stress tests, and expert-constrained case studies, this roadmap offers a capability-centered lens for understanding, evaluating, and advancing the next generation of intelligent visual generation systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that recent visual generation models excel in photorealism, typography, instruction following, and interactive editing but continue to struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. It argues for a shift from appearance synthesis to intelligent visual generation grounded in structure, dynamics, domain knowledge, and causal relations. To frame this evolution, the manuscript introduces a five-level taxonomy progressing from Atomic Generation through Conditional, In-Context, and Agentic Generation to World-Modeling Generation. It analyzes technical drivers including flow matching, unified understanding-and-generation models, improved representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. The work further contends that current evaluations overestimate progress by prioritizing perceptual quality over structural, temporal, and causal failures, supported by a combination of benchmark review, in-the-wild stress tests, and expert-constrained case studies.
Significance. If the taxonomy and evaluation critique hold, this roadmap could meaningfully shape research priorities in computer vision by offering a capability-centered organizing framework that moves the field beyond passive rendering toward interactive, knowledge-aware systems. The synthesis of drivers such as flow matching and unified models provides practical guidance for practitioners, while the call for better evaluation practices could stimulate development of more diagnostic benchmarks. As a conceptual synthesis rather than an empirical contribution, its primary value lies in clarifying the progression toward agentic and world-modeling capabilities and highlighting gaps that current perceptual metrics miss.
major comments (2)
- [§3] §3 (Taxonomy definition): The boundary between Agentic Generation and World-Modeling Generation is not sharply delineated; both levels invoke interaction, domain knowledge, and causal relations, risking overlap that could undermine the taxonomy's utility as a precise organizing lens for classifying models or tracking progress.
- [§4] §4 (Evaluation section): The central claim that evaluations systematically overestimate progress by missing structural, temporal, and causal failures is supported only by a high-level description of benchmark review, stress tests, and case studies; without concrete failure statistics, quantitative comparisons to perceptual metrics, or tabulated examples of specific models' shortcomings, the critique remains qualitative and less load-bearing for the argument.
minor comments (3)
- [Abstract] Abstract: The phrase 'expert-constrained case studies' is introduced without clarifying the nature of the constraints or how expert input was incorporated, which could affect reproducibility and reader interpretation of the evaluation methodology.
- [Introduction] Introduction and §2: Several technical drivers (e.g., flow matching, reward modeling) are listed without brief one-sentence definitions or pointers to foundational references on first mention, reducing accessibility for readers less familiar with the sub-area.
- [References] References: The analysis of unified understanding-and-generation models would benefit from explicit citations to the specific works discussed, ensuring the literature synthesis is fully traceable.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback. We appreciate the recognition that the taxonomy and evaluation critique could help shape research priorities in visual generation. We address each major comment below and will revise the manuscript accordingly to improve clarity and evidentiary support.
read point-by-point responses
-
Referee: [§3] §3 (Taxonomy definition): The boundary between Agentic Generation and World-Modeling Generation is not sharply delineated; both levels invoke interaction, domain knowledge, and causal relations, risking overlap that could undermine the taxonomy's utility as a precise organizing lens for classifying models or tracking progress.
Authors: We thank the referee for identifying this potential source of overlap. We agree that sharper boundaries will strengthen the taxonomy's utility. In the revised manuscript, we will expand §3 with an explicit comparison table across five dimensions (interaction scope, knowledge integration, state persistence, causal reasoning depth, and evaluation focus) to differentiate the levels. Agentic Generation will be characterized as goal-directed interaction with external tools or environments for task completion, while World-Modeling Generation will be defined by the construction and maintenance of an internal, causally consistent simulation supporting prediction and long-horizon coherence. We will also add one or two additional model exemplars per level to illustrate the distinctions without altering the overall five-level progression. revision: yes
-
Referee: [§4] §4 (Evaluation section): The central claim that evaluations systematically overestimate progress by missing structural, temporal, and causal failures is supported only by a high-level description of benchmark review, stress tests, and case studies; without concrete failure statistics, quantitative comparisons to perceptual metrics, or tabulated examples of specific models' shortcomings, the critique remains qualitative and less load-bearing for the argument.
Authors: We acknowledge that the evaluation critique would benefit from more concrete presentation. The manuscript already reviews specific benchmark limitations (e.g., reliance on FID and CLIP scores that overlook spatial and causal errors) and describes failure modes from in-the-wild tests and expert case studies. To make this evidence more load-bearing, we will insert a new summary table in the revised §4 that tabulates representative models, their reported perceptual scores, and the structural/temporal/causal shortcomings observed in the stress tests. This table will draw directly from the existing benchmark review and case studies rather than introducing new experiments, thereby preserving the paper's scope as a conceptual synthesis while improving traceability and impact of the argument. revision: yes
Circularity Check
No circularity: taxonomy is an external organizing framework with no derivations or self-referential reductions
full rationale
The paper is a survey and roadmap proposing a five-level taxonomy (Atomic Generation through World-Modeling Generation) to frame a shift toward intelligent visual generation. No equations, fitted parameters, predictions, or uniqueness theorems appear in the abstract or described content. The taxonomy functions as an interpretive lens rather than a result derived from internal definitions, self-citations, or data fits. Technical driver analysis and evaluation critiques rely on literature synthesis and stated benchmark reviews without reducing to the paper's own inputs by construction. This is a standard non-circular conceptual contribution.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Recent visual generation models struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding.
- domain assumption Current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures.
invented entities (2)
-
Atomic Generation
no independent evidence
-
World-Modeling Generation
no independent evidence
Forward citations
Cited by 2 Pith papers
-
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
-
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.
Reference graph
Works this paper leans on
-
[1]
World Simulation with Video Foundation Models for Physical AI
Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical AI.arXiv preprint arXiv:2511.00062,
work page internal anchor Pith review arXiv
-
[2]
Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan, 2017.https://arxiv.org/abs/1701.07875. Artificial Analysis. Artificial analysis image arena.https://artificialanalysis.ai/text-to-image,
work page Pith review arXiv 2017
-
[3]
Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro, Samuele Papa, Stefano Ghidoni, and Efstratios Gavves. Dream to manipulate: Compositional world models empowering robot imitation learning with imagination.arXiv preprint arXiv:2412.14957,
-
[4]
Giulia Bertazzini, Daniele Baracchi, Dasara Shullani, Isao Echizen, and Alessandro Piva. Dragon: A large-scale dataset of realistic images generated by diffusion models.arXiv preprint arXiv:2505.11257, 2025a. Giulia Bertazzini, Daniele Baracchi, Dasara Shullani, Isao Echizen, and Alessandro Piva. Dragon: A large-scale dataset of realistic images generated...
-
[5]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301,
work page internal anchor Pith review arXiv
-
[6]
Large Scale GAN Training for High Fidelity Natural Image Synthesis
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis, 2019.https://arxiv.org/abs/1809.11096. Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition...
work page internal anchor Pith review arXiv 2019
-
[7]
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699,
work page internal anchor Pith review arXiv
-
[8]
Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025
Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951,
-
[10]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
https://arxiv.org/abs/2405.09818. 109 Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shijie Zhou, Weilin Huang, Gordon Wetzstein, Mohammad Soleymani, and Peng Wang. Bytemorph: Benchmarking instruction-guided image editing with non-rigid motions. arXiv preprint arXiv:2506.03107, 2025a. Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shij...
work page internal anchor Pith review arXiv
-
[11]
Large Video Planner Enables Generalizable Robot Control
Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, et al. Large video planner enables generalizable robot control.arXiv preprint arXiv:2512.15840, 2025a. Haoyu Chen, Xiaojie Xu, Wenbo Li, Jingjing Ren, Tian Ye, Songhua Liu, Ying-Cong Chen, Lei Zhu, and Xinchao Wang...
work page internal anchor Pith review arXiv 2025
-
[12]
Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao, and Tao Chen. δ-dit: A training-free acceleration method tailored for diffusion transformers.arXiv preprint arXiv:2406.01125, 2024b. Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Peter Sushko, Gaoyang Jiang, Yao Wan, and Ranjay Krishna. M...
-
[13]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets, 2016.https://arxiv.org/abs/1606. 03657. Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal...
work page internal anchor Pith review arXiv 2016
-
[14]
PaddleOCR 3.0 Technical Report
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report, 2025.https://arxiv.org/abs/2507.05595. Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanj...
work page internal anchor Pith review arXiv 2025
-
[15]
Emerging Properties in Unified Multimodal Pretraining
https://arxiv.org/abs/2505.14683. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee,
work page internal anchor Pith review arXiv
-
[16]
Alara Dirik, Tuanfeng Wang, Duygu Ceylan, Stefanos Zafeiriou, and Anna Frühstück. Prism: A unified framework for photorealistic reconstruction and intrinsic scene modeling, 2025.https://arxiv.org/abs/2504.14219. Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-...
-
[17]
and Barry Zhang
Erik S. and Barry Zhang. Building effective agents. Anthropic Engineering Blog, December 2024.https://www. anthropic.com/engineering/building-effective-agents. Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lac...
2024
-
[18]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
https://arxiv.org/abs/2403.03206. Jiacheng Fan, Zhiyue Zhao, Yiqian Zhang, Chao Chen, Peide Wang, Hengdi Zhang, and Zhengxue Cheng. Robopaint: From human demonstration to any robot and any view.arXiv preprint arXiv:2602.05325,
work page internal anchor Pith review arXiv
-
[19]
Tinyfusion: Diffusion transformers learned shallow
111 Gongfan Fang, Kunjun Li, Xinyin Ma, and Xinchao Wang. Tinyfusion: Diffusion transformers learned shallow. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 18144–18154, 2025a. HaoFang, ZechaoZhan, WeixinFeng, ZiweiHuang, XubinLi, andTiezhengGe. Tbstar-edit: Fromimageeditingpattern shifting to consistency enhancement.ArXiv...
-
[20]
One Step Diffusion via Shortcut Models
Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557,
work page internal anchor Pith review arXiv
-
[21]
arXiv preprint arXiv:2506.01943 , year=
Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, and Dahua Lin. Learning video generation for robotic manipulation with collaborative trajectory control.arXiv preprint arXiv:2506.01943,
-
[22]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618,
work page internal anchor Pith review arXiv
-
[23]
doi: 10.1109/TPAMI.2025.3610614. Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346,
-
[24]
Seed-data-edit technical report: A hybrid dataset for instructional image editing
Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, and Ying Shan. Seed-data-edit technical report: A hybrid dataset for instructional image editing.arXiv preprint arXiv:2405.04007,
-
[25]
Mean Flows for One-step Generative Modeling
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025a. Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, and Jie Jiang. X-omni: Reinforcement learning makes discrete a...
work page internal anchor Pith review arXiv 2025
-
[26]
Generative Adversarial Networks
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014.https://arxiv.org/abs/1406.2661. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans, 2017.https://arxiv.org/abs/1...
work page internal anchor Pith review arXiv 2014
-
[27]
Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, and Lu Jiang. Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.arXiv preprint arXiv:2506.18898, 2025.https://arxiv.org/abs/2506.18898. Junjie He, Yifeng Geng, and Liefeng Bo. Uniportrait: A unified framework for identit...
-
[28]
arXiv preprint arXiv:2603.28088 , year=
Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Yu Cheng, and Yang Yang. Gems: Agent-native multimodal generation with memory and skills.arXiv preprint arXiv:2603.28088,
-
[29]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022.https: //arxiv.org/abs/2207.12598. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851,
work page internal anchor Pith review arXiv 2022
-
[30]
GAIA-1: A Generative World Model for Autonomous Driving
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. GAIA-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080,
work page internal anchor Pith review arXiv
-
[31]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024a. Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy w...
work page internal anchor Pith review arXiv 2025
-
[32]
Yiwen Huang, Aaron Gokaslan, Volodymyr Kuleshov, and James Tompkin. The GAN is dead; long live the GAN! A modern GAN baseline.arXiv preprint arXiv:2501.05441, 2025a.https://arxiv.org/abs/2501.05441. Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, and Lu Sheng. Mv-adapter: Multi-view consistent image generation made easy. InProc...
-
[33]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509,
work page internal anchor Pith review arXiv
-
[34]
Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models. arXiv preprint arXiv:2505.12705,
-
[35]
COLE: A hierarchical generation framework for graphic design.arXiv preprint arXiv:2311.16974,
Peidong Jia, Chenxuan Li, Zeyu Liu, Yichao Shen, Xingru Chen, Yuhui Yuan, Yinglin Zheng, Dong Chen, Ji Li, Xiaodong Xie, et al. COLE: A hierarchical generation framework for graphic design.arXiv preprint arXiv:2311.16974,
-
[36]
Qifei Jia, Yu Liu, Yajie Chai, Xintong Yao, Qiming Lu, Yasen Zhang, Runyu Shi, Ying Huang, and Guoquan Zhang. Lego-edit: A general image editing framework with model-level bricks and mllm builder.ArXiv, abs/2509.12883, 2025.https://api.semanticscholar.org/CorpusID:281325583. Bowen Jiang, Yuan Yuan, Xinyi Bai, Zhuoqun Hao, Alyson Yin, Yaojie Hu, Wenyu Liao...
-
[37]
pdf, accessed 2026-04-22
Available athttps://joyai-image.s3.cn-north-1.jdcloud-oss.com/JoyAI-Image. pdf, accessed 2026-04-22. Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. InCVPR,
2026
-
[38]
A style-based generator architecture for generative adversarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks, 2019.https://arxiv.org/abs/1812.04948. 114 Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InAdvances in Neural Information Processing Systems, volume 35,
-
[39]
Pushkal Katara, Zhou Xian, and Katerina Fragkiadaki. Gen2sim: Scaling up robot learning in simulation with generative models.arXiv preprint arXiv:2310.18308,
-
[40]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,
work page internal anchor Pith review arXiv
- [41]
-
[42]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...
work page internal anchor Pith review arXiv 2025
-
[43]
arXiv preprint arXiv:2508.09976 (2025)
Marion Lepert, Jiaying Fang, and Jeannette Bohg. Masquerade: Learning from in-the-wild human videos using data-editing.arXiv preprint arXiv:2508.09976, 2025a. Marion Lepert, Jiaying Fang, and Jeannette Bohg. Phantom: Training robots without robots using only human videos. arXiv preprint arXiv:2503.00779, 2025b. Guangrun Li, Yaoxu Lyu, Zhuoyang Liu, Chengk...
-
[44]
Jarvisart: Liberating human artistic creativity via an intelligent photo retouching agent
Yunlong Lin, Zixu Lin, Kunjie Lin, Jinbin Bai, Panwang Pan, Chenxin Li, Haoyu Chen, Zhongdao Wang, Xinghao Ding, Wenbo Li, et al. Jarvisart: Liberating human artistic creativity via an intelligent photo retouching agent. arXiv preprint arXiv:2506.17612,
-
[45]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023.https://arxiv.org/abs/2210.02747. Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yi Xin, Xinyue Li, Qi Qin, Yu Qiao, Hongsheng Li, and Peng Gao. Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimod...
work page internal anchor Pith review arXiv 2023
-
[46]
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
Blog post. Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081,
work page internal anchor Pith review arXiv
-
[47]
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. InAdvances in Neural Information Processing Systems, volume 35, 2022a. Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of ...
-
[48]
https://api.semanticscholar.org/CorpusID: 279070661. Minxing Luo, Zixun Xia, Liaojun Chen, Zhenhang Li, Weichao Zeng, Jianye Wang, Wentao Cheng, Yaxing Wang, Yu ZHOU, and Jian Yang. Beyond flat text: Dual self-inherited guidance for visual text generation.ArXiv, abs/2501.05892, 2025a.https://api.semanticscholar.org/CorpusID:275458598. Xin Luo, Jiahao Wang...
-
[49]
117 Jian Ma, Yonglin Deng, Chen Chen, Nanyang Du, Haonan Lu, and Zhenyu Yang. Glyphdraw2: Automatic generation of complex glyph posters with diffusion models and large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 5955–5963, 2025a. Jiancang Ma, Xujie Zhu, Zihao Pan, Qirong Peng, Xu Guo, Chen Chen, and H...
-
[50]
Qi Ma, Yue Li, Bin Ren, Nicu Sebe, Ender Konukoglu, Theo Gevers, Luc Van Gool, and Danda Pani Paudel. Shapesplat: A large-scale dataset of gaussian splats and their self-supervised pretraining.arXiv preprint arXiv:2408.10906, 2024b. Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng,...
-
[51]
arXiv preprint arXiv:2508.15772 (2025)
Qingyang Mao, Qi Cai, Yehao Li, Yingwei Pan, Mingyue Cheng, Ting Yao, Qi Liu, and Tao Mei. Visual autoregressive modeling for instruction-guided image editing.ArXiv, abs/2508.15772, 2025.https://api.semanticscholar.org/ CorpusID:280700028. Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adv...
-
[52]
arXiv preprint arXiv:2601.07823 , year=
Zhiting Mei, Tenny Yin, Ola Shorinwa, Apurva Badithela, Zhonghe Zheng, Joseph Bruno, Madison Bland, Lihan Zha, Asher Hancock, Jaime Fernández Fisac, Philip Dames, and Anirudha Majumdar. Video generation models in robotics: Applications, research challenges, future directions.arXiv preprint arXiv:2601.07823,
-
[53]
Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yiran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, et al. Phybench: A physical commonsense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802,
-
[54]
Spectral Normalization for Generative Adversarial Networks
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks, 2018.https://arxiv.org/abs/1802.05957. 118 MMagic Contributors. MMagic: OpenMMLab multimodal advanced, generative, and intelligent creation toolbox. https://github.com/open-mmlab/mmagic,
work page Pith review arXiv 2018
-
[55]
Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al
Open-source repository. Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al. Dreamo: A unified framework for image customization. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–12,
2025
-
[56]
Transition matching distillation for fast video generation.arXiv preprint arXiv:2601.09881, 2026
Weili Nie, Julius Berner, Nanye Ma, Chao Liu, Saining Xie, and Arash Vahdat. Transition matching distillation for fast video generation.arXiv preprint arXiv:2601.09881,
-
[57]
Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265,
-
[58]
Sajjad Pakdamansavoji, Mozhgan Pourkeshavarz, Adam Sigal, Zhiyuan Li, Rui Heng Yang, and Amir Rasouli. Improving robotic manipulation robustness via NICE scene surgery.arXiv preprint arXiv:2511.22777, 2025a. Sajjad Pakdamansavoji, Mozhgan Pourkeshavarz, Adam Sigal, Zhiyuan Li, Rui Heng Yang, and Amir Rasouli. Improving robotic manipulation robustness via ...
-
[59]
Scalable Diffusion Models with Transformers
Available athttps://deepmind.google/blog/genie-2-a-large-scale-foundation-world-model/. William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.https://arxiv.org/abs/2212.09748. Xu Peng, Junwei Zhu, Boyuan Jiang, Ying Tai, Donghao Luo, Jiangning Zh...
work page internal anchor Pith review arXiv 2023
-
[60]
arXiv preprint arXiv:2602.12529 , year=
Bowen Ping, Chengyou Jia, Minnan Luo, Hangwei Qian, and Ivor Tsang. Flow-factory: A unified framework for reinforcement learning in flow-matching models.arXiv preprint arXiv:2602.12529,
-
[61]
Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint arXiv:2310.03739,
-
[62]
119 Yifan Pu, Yiming Zhao, Zhicong Tang, Ruihong Yin, Haoxing Ye, Yuhui Yuan, Dong Chen, Jianmin Bao, Sirui Zhang, Yanbin Wang, Lin Liang, Lijuan Wang, Ji Li, Xiu Li, Zhouhui Lian, Gao Huang, and Baining Guo. Art: Anonymous region transformer for variable multi-layer transparent image generation.arXiv preprint arXiv:2502.18364, 2025a. Yifan Pu, Yiming Zha...
-
[63]
Du, Zehuan Yuan, and Xinglong Wu
Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, and Xinglong Wu. TokenFlow: Unified image tokenizer for multimodal understanding and generation.arXiv preprint arXiv:2412.03069, 2024.https://arxiv.org/abs/2412.03069. Julian Quevedo, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: ...
-
[64]
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks, 2016.https://arxiv.org/abs/1511.06434. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational conference...
work page internal anchor Pith review arXiv 2016
-
[65]
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022.https://arxiv.org/abs/2112.10752. Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven genera...
work page internal anchor Pith review arXiv 2022
-
[66]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022.https://arxiv.org/abs/2205.11487. Tim...
work page internal anchor Pith review arXiv 2022
-
[67]
https://zenodo.org/records/17344183. Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, et al. Imagenworld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks. InThe Fourteenth International Con...
-
[68]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review arXiv
-
[69]
Johannes Schusterbauer, Ming Gui, Frank Fundel, and Björn Ommer. Diff2flow: Training flow matching models via diffusion model alignment, 2025.https://arxiv.org/abs/2506.02221. 120 Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimod...
-
[70]
Post-training quantization on diffusion models
Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1972–1981,
1972
-
[71]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review arXiv
-
[72]
IMAGHar- mony: Controllable image editing with consistent object quantity and layout,
Fei Shen, Xiaoyu Du, Yutong Gao, Jian Yu, Yushe Cao, Xing Lei, and Jinhui Tang. Imagharmony: Controllable image editing with consistent object quantity and layout.ArXiv, abs/2506.01949, 2025a.https://api.semanticscholar. org/CorpusID:279119734. Yang Shen, Xiu-Shen Wei, Yifan Sun, Yuxin Song, Tao Yuan, Jian Jin, Heyang Xu, Yazhou Yao, and Errui Ding. Expla...
-
[73]
Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963,
Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, and Baining Guo. Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963, 2025b. Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instantbooth: Personalized text-to-image generation without test-time finetuning. InProceeding...
-
[74]
Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025
Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025a. https://arxiv.org/abs/2510.15301. Wenda Shi, Yiren Song, Zihan Rao, Dengming Zhang, Jiaming Liu, and Xingxing Zou. Wordcon: Word-level typography ...
-
[75]
Shivam Singh, Yiming Chen, Agneet Chatterjee, Amit Raj, James Hays, Yezhou Yang, and Chitta Baral. Chimera: Compositional image generation using part-based concepting, 2025.https://arxiv.org/abs/2510.18083. Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2021a. Yan...
-
[76]
Mitty: Diffusion-based human-to- robot video generation.arXiv preprint arXiv:2512.17253, 2025
Yiren Song, Cheng Liu, Weijia Mao, and Mike Zheng Shou. Mitty: Diffusion-based human-to-robot video generation. arXiv preprint arXiv:2512.17253, 2025a. Yuxin Song, Wenkai Dong, Shizun Wang, Qi Zhang, Song Xue, Tao Yuan, Hu Yang, Haocheng Feng, Hang Zhou, Xinyan Xiao, et al. Query-kontext: An unified multimodal model for image generation and editing.arXiv ...
-
[78]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
https: //arxiv.org/abs/2406.06525. Peter Sushko, Ayana Bharadwaj, Zhi Yang Lim, Vasily Ilin, Ben Caffee, Dongping Chen, Mohammadreza Salehi, Cheng- Yu Hsieh, and Ranjay Krishna. Realedit: Reddit edits as a large-scale empirical dataset for image transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p...
work page internal anchor Pith review arXiv
-
[79]
Mohammad Reza Taesiri, Brandon Collins, Logan Bolton, Viet Dac Lai, Franck Dernoncourt, Trung Bui, and Anh Totti Nguyen. Understanding generative ai capabilities in everyday image editing tasks.arXiv preprint arXiv:2505.16181,
-
[80]
Jiale Tao, Yanbing Zhang, Qixun Wang, Yiji Cheng, Haofan Wang, Xu Bai, Zhengguang Zhou, Ruihuang Li, Linqing Wang, Chunyu Wang, Qin Lin, and Qinglin Lu. Instantcharacter: Personalize any characters with a scalable diffusion transformer framework.ArXiv, abs/2504.12395, 2025a. https://api.semanticscholar.org/CorpusID:277856764. Tang Tao, Likui Zhang, Youpen...
-
[81]
Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024
Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837,
-
[82]
DataVisT5: A pre-trained language model for jointly understanding text and data visualization
Zhuoyue Wan, Yuanfeng Song, Shuaimin Li, Chen Jason Zhang, and Raymond Chi-Wing Wong. DataVisT5: A pre-trained language model for jointly understanding text and data visualization. In41st IEEE International Conference on Data Engineering, ICDE 2025, pages 1704–1717. IEEE,
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.