pith. machine review for the scientific record. sign in

arxiv: 2604.28185 · v1 · submitted 2026-04-30 · 💻 cs.CV

Recognition: unknown

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Authors on Pith no claims yet

Pith reviewed 2026-05-07 06:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual generationworld modelingagentic AIgenerative modelstaxonomyevaluation benchmarksdiffusion modelscausal reasoning
0
0 comments X

The pith

Visual generation must advance from basic image synthesis to agentic world modeling that respects structure, dynamics, and causal relations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recent visual generation models produce photorealistic outputs but lack reliable spatial reasoning, long-term consistency, and causal understanding. The paper argues the field needs to shift toward intelligent generation where outputs are grounded in physical structure, temporal dynamics, domain knowledge, and cause-effect relations rather than appearance alone. To guide this shift, it introduces a five-level taxonomy that progresses from simple atomic mapping through conditional, in-context, and agentic stages up to full world-modeling generation. The authors review technical drivers such as flow matching and unified models while critiquing how standard benchmarks emphasize perceptual quality and miss deeper failures. This roadmap aims to reorient research toward systems that can act as interactive world simulators.

Core claim

The paper establishes that visual generation evolves along five distinct capability levels—Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation—moving from passive renderers to interactive, agentic, world-aware generators. Progress requires integrating techniques like flow matching, unified understanding-generation architectures, improved representations, reward modeling, and synthetic data, while current evaluations systematically overestimate advancement by overlooking structural, temporal, and causal errors revealed through stress tests and expert case studies.

What carries the argument

A five-level taxonomy that classifies visual generation systems by increasing degrees of intelligence, from basic atomic mapping of pixels to full agentic world modeling that maintains persistent states, causal relations, and interactive dynamics.

If this is right

  • Models will require unified architectures that jointly handle visual understanding and generation rather than separate pipelines.
  • Training will shift emphasis toward data curation, synthetic data distillation, and reward modeling that penalize causal inconsistencies.
  • Evaluation protocols must incorporate stress tests for spatial reasoning and long-term temporal coherence beyond single-frame quality.
  • Sampling acceleration techniques will become essential to support real-time agentic interaction and iterative world updates.
  • Post-training methods will focus on aligning outputs with domain knowledge and physical plausibility constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adopting the taxonomy could redirect funding and benchmarks away from single-image photorealism toward simulators that support planning and intervention.
  • The same progression might apply to other modalities such as audio or 3D scene generation, suggesting a general pattern for multimodal world models.
  • If world-modeling generation succeeds, downstream applications like robotics simulation and video game engines could merge with generative AI systems.
  • The framework implies testable predictions about which technical drivers most accelerate movement between levels.

Load-bearing premise

The five-level taxonomy correctly captures the necessary stages of evolution in the field and that standard evaluations fail to detect structural, temporal, and causal shortcomings.

What would settle it

A benchmark suite that scores current top models on long-horizon causal consistency and persistent state tracking in generated scenes, where performance drops sharply compared with existing perceptual metrics.

read the original abstract

Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation models, improved visual representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. We further show that current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures. By combining benchmark review, in-the-wild stress tests, and expert-constrained case studies, this roadmap offers a capability-centered lens for understanding, evaluating, and advancing the next generation of intelligent visual generation systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that recent visual generation models excel in photorealism, typography, instruction following, and interactive editing but continue to struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. It argues for a shift from appearance synthesis to intelligent visual generation grounded in structure, dynamics, domain knowledge, and causal relations. To frame this evolution, the manuscript introduces a five-level taxonomy progressing from Atomic Generation through Conditional, In-Context, and Agentic Generation to World-Modeling Generation. It analyzes technical drivers including flow matching, unified understanding-and-generation models, improved representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. The work further contends that current evaluations overestimate progress by prioritizing perceptual quality over structural, temporal, and causal failures, supported by a combination of benchmark review, in-the-wild stress tests, and expert-constrained case studies.

Significance. If the taxonomy and evaluation critique hold, this roadmap could meaningfully shape research priorities in computer vision by offering a capability-centered organizing framework that moves the field beyond passive rendering toward interactive, knowledge-aware systems. The synthesis of drivers such as flow matching and unified models provides practical guidance for practitioners, while the call for better evaluation practices could stimulate development of more diagnostic benchmarks. As a conceptual synthesis rather than an empirical contribution, its primary value lies in clarifying the progression toward agentic and world-modeling capabilities and highlighting gaps that current perceptual metrics miss.

major comments (2)
  1. [§3] §3 (Taxonomy definition): The boundary between Agentic Generation and World-Modeling Generation is not sharply delineated; both levels invoke interaction, domain knowledge, and causal relations, risking overlap that could undermine the taxonomy's utility as a precise organizing lens for classifying models or tracking progress.
  2. [§4] §4 (Evaluation section): The central claim that evaluations systematically overestimate progress by missing structural, temporal, and causal failures is supported only by a high-level description of benchmark review, stress tests, and case studies; without concrete failure statistics, quantitative comparisons to perceptual metrics, or tabulated examples of specific models' shortcomings, the critique remains qualitative and less load-bearing for the argument.
minor comments (3)
  1. [Abstract] Abstract: The phrase 'expert-constrained case studies' is introduced without clarifying the nature of the constraints or how expert input was incorporated, which could affect reproducibility and reader interpretation of the evaluation methodology.
  2. [Introduction] Introduction and §2: Several technical drivers (e.g., flow matching, reward modeling) are listed without brief one-sentence definitions or pointers to foundational references on first mention, reducing accessibility for readers less familiar with the sub-area.
  3. [References] References: The analysis of unified understanding-and-generation models would benefit from explicit citations to the specific works discussed, ensuring the literature synthesis is fully traceable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We appreciate the recognition that the taxonomy and evaluation critique could help shape research priorities in visual generation. We address each major comment below and will revise the manuscript accordingly to improve clarity and evidentiary support.

read point-by-point responses
  1. Referee: [§3] §3 (Taxonomy definition): The boundary between Agentic Generation and World-Modeling Generation is not sharply delineated; both levels invoke interaction, domain knowledge, and causal relations, risking overlap that could undermine the taxonomy's utility as a precise organizing lens for classifying models or tracking progress.

    Authors: We thank the referee for identifying this potential source of overlap. We agree that sharper boundaries will strengthen the taxonomy's utility. In the revised manuscript, we will expand §3 with an explicit comparison table across five dimensions (interaction scope, knowledge integration, state persistence, causal reasoning depth, and evaluation focus) to differentiate the levels. Agentic Generation will be characterized as goal-directed interaction with external tools or environments for task completion, while World-Modeling Generation will be defined by the construction and maintenance of an internal, causally consistent simulation supporting prediction and long-horizon coherence. We will also add one or two additional model exemplars per level to illustrate the distinctions without altering the overall five-level progression. revision: yes

  2. Referee: [§4] §4 (Evaluation section): The central claim that evaluations systematically overestimate progress by missing structural, temporal, and causal failures is supported only by a high-level description of benchmark review, stress tests, and case studies; without concrete failure statistics, quantitative comparisons to perceptual metrics, or tabulated examples of specific models' shortcomings, the critique remains qualitative and less load-bearing for the argument.

    Authors: We acknowledge that the evaluation critique would benefit from more concrete presentation. The manuscript already reviews specific benchmark limitations (e.g., reliance on FID and CLIP scores that overlook spatial and causal errors) and describes failure modes from in-the-wild tests and expert case studies. To make this evidence more load-bearing, we will insert a new summary table in the revised §4 that tabulates representative models, their reported perceptual scores, and the structural/temporal/causal shortcomings observed in the stress tests. This table will draw directly from the existing benchmark review and case studies rather than introducing new experiments, thereby preserving the paper's scope as a conceptual synthesis while improving traceability and impact of the argument. revision: yes

Circularity Check

0 steps flagged

No circularity: taxonomy is an external organizing framework with no derivations or self-referential reductions

full rationale

The paper is a survey and roadmap proposing a five-level taxonomy (Atomic Generation through World-Modeling Generation) to frame a shift toward intelligent visual generation. No equations, fitted parameters, predictions, or uniqueness theorems appear in the abstract or described content. The taxonomy functions as an interpretive lens rather than a result derived from internal definitions, self-citations, or data fits. Technical driver analysis and evaluation critiques rely on literature synthesis and stated benchmark reviews without reducing to the paper's own inputs by construction. This is a standard non-circular conceptual contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central argument rests on domain assumptions about current model shortcomings and introduces new conceptual categories in the taxonomy without independent empirical grounding or falsifiable predictions in the abstract.

axioms (2)
  • domain assumption Recent visual generation models struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding.
    This limitation is stated directly in the abstract as the motivation for the proposed shift and taxonomy.
  • domain assumption Current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures.
    Asserted in the abstract as a key observation from benchmark review and case studies, but without the actual review details provided here.
invented entities (2)
  • Atomic Generation no independent evidence
    purpose: Lowest level in the taxonomy representing passive, appearance-focused renderers.
    Newly defined category in the five-level framework introduced by the authors.
  • World-Modeling Generation no independent evidence
    purpose: Highest level representing interactive, agentic generators with full world awareness and causal understanding.
    Newly defined target state in the taxonomy as the goal for intelligent visual generation.

pith-pipeline@v0.9.0 · 5599 in / 1575 out tokens · 90380 ms · 2026-05-07T06:33:59.071699+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

    cs.CV 2026-05 unverdicted novelty 6.0

    Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...

  2. WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

    cs.CV 2026-05 unverdicted novelty 6.0

    The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.

Reference graph

Works this paper leans on

104 extracted references · 99 canonical work pages · cited by 2 Pith papers · 36 internal anchors

  1. [1]

    World Simulation with Video Foundation Models for Physical AI

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical AI.arXiv preprint arXiv:2511.00062,

  2. [2]

    Wasserstein GAN

    Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan, 2017.https://arxiv.org/abs/1701.07875. Artificial Analysis. Artificial analysis image arena.https://artificialanalysis.ai/text-to-image,

  3. [3]

    Dream to manipulate: Compositional world models empowering robot imitation learning with imagination, 2025

    Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro, Samuele Papa, Stefano Ghidoni, and Efstratios Gavves. Dream to manipulate: Compositional world models empowering robot imitation learning with imagination.arXiv preprint arXiv:2412.14957,

  4. [4]

    Dragon: A large-scale dataset of realistic images generated by diffusion models.arXiv preprint arXiv:2505.11257, 2025a

    Giulia Bertazzini, Daniele Baracchi, Dasara Shullani, Isao Echizen, and Alessandro Piva. Dragon: A large-scale dataset of realistic images generated by diffusion models.arXiv preprint arXiv:2505.11257, 2025a. Giulia Bertazzini, Daniele Baracchi, Dasara Shullani, Isao Echizen, and Alessandro Piva. Dragon: A large-scale dataset of realistic images generated...

  5. [5]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301,

  6. [6]

    Large Scale GAN Training for High Fidelity Natural Image Synthesis

    Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis, 2019.https://arxiv.org/abs/1809.11096. Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition...

  7. [7]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699,

  8. [8]

    Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

    Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951,

  9. [10]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    https://arxiv.org/abs/2405.09818. 109 Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shijie Zhou, Weilin Huang, Gordon Wetzstein, Mohammad Soleymani, and Peng Wang. Bytemorph: Benchmarking instruction-guided image editing with non-rigid motions. arXiv preprint arXiv:2506.03107, 2025a. Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shij...

  10. [11]

    Large Video Planner Enables Generalizable Robot Control

    Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, et al. Large video planner enables generalizable robot control.arXiv preprint arXiv:2512.15840, 2025a. Haoyu Chen, Xiaojie Xu, Wenbo Li, Jingjing Ren, Tian Ye, Songhua Liu, Ying-Cong Chen, Lei Zhu, and Xinchao Wang...

  11. [12]

    δ-dit: A training-free acceleration method tailored for diffusion transformers.arXiv preprint arXiv:2406.01125, 2024b

    Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao, and Tao Chen. δ-dit: A training-free acceleration method tailored for diffusion transformers.arXiv preprint arXiv:2406.01125, 2024b. Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Peter Sushko, Gaoyang Jiang, Yao Wan, and Ranjay Krishna. M...

  12. [13]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets, 2016.https://arxiv.org/abs/1606. 03657. Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal...

  13. [14]

    PaddleOCR 3.0 Technical Report

    Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report, 2025.https://arxiv.org/abs/2507.05595. Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanj...

  14. [15]

    Emerging Properties in Unified Multimodal Pretraining

    https://arxiv.org/abs/2505.14683. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee,

  15. [16]

    Prism: A unified framework for photorealistic reconstruction and intrinsic scene modeling, 2025.https://arxiv.org/abs/2504.14219

    Alara Dirik, Tuanfeng Wang, Duygu Ceylan, Stefanos Zafeiriou, and Anna Frühstück. Prism: A unified framework for photorealistic reconstruction and intrinsic scene modeling, 2025.https://arxiv.org/abs/2504.14219. Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-...

  16. [17]

    and Barry Zhang

    Erik S. and Barry Zhang. Building effective agents. Anthropic Engineering Blog, December 2024.https://www. anthropic.com/engineering/building-effective-agents. Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lac...

  17. [18]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    https://arxiv.org/abs/2403.03206. Jiacheng Fan, Zhiyue Zhao, Yiqian Zhang, Chao Chen, Peide Wang, Hengdi Zhang, and Zhengxue Cheng. Robopaint: From human demonstration to any robot and any view.arXiv preprint arXiv:2602.05325,

  18. [19]

    Tinyfusion: Diffusion transformers learned shallow

    111 Gongfan Fang, Kunjun Li, Xinyin Ma, and Xinchao Wang. Tinyfusion: Diffusion transformers learned shallow. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 18144–18154, 2025a. HaoFang, ZechaoZhan, WeixinFeng, ZiweiHuang, XubinLi, andTiezhengGe. Tbstar-edit: Fromimageeditingpattern shifting to consistency enhancement.ArXiv...

  19. [20]

    One Step Diffusion via Shortcut Models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557,

  20. [21]

    arXiv preprint arXiv:2506.01943 , year=

    Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, and Dahua Lin. Learning video generation for robotic manipulation with collaborative trajectory control.arXiv preprint arXiv:2506.01943,

  21. [22]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618,

  22. [23]

    Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al

    doi: 10.1109/TPAMI.2025.3610614. Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346,

  23. [24]

    Seed-data-edit technical report: A hybrid dataset for instructional image editing

    Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, and Ying Shan. Seed-data-edit technical report: A hybrid dataset for instructional image editing.arXiv preprint arXiv:2405.04007,

  24. [25]

    Mean Flows for One-step Generative Modeling

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025a. Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, and Jie Jiang. X-omni: Reinforcement learning makes discrete a...

  25. [26]

    Generative Adversarial Networks

    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014.https://arxiv.org/abs/1406.2661. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans, 2017.https://arxiv.org/abs/1...

  26. [27]

    Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.arXiv preprint arXiv:2506.18898, 2025.https://arxiv.org/abs/2506.18898

    Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, and Lu Jiang. Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.arXiv preprint arXiv:2506.18898, 2025.https://arxiv.org/abs/2506.18898. Junjie He, Yifeng Geng, and Liefeng Bo. Uniportrait: A unified framework for identit...

  27. [28]

    arXiv preprint arXiv:2603.28088 , year=

    Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Yu Cheng, and Yang Yang. Gems: Agent-native multimodal generation with memory and skills.arXiv preprint arXiv:2603.28088,

  28. [29]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022.https: //arxiv.org/abs/2207.12598. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851,

  29. [30]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. GAIA-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080,

  30. [31]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024a. Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy w...

  31. [32]

    The GAN is dead; long live the GAN! A modern GAN baseline.arXiv preprint arXiv:2501.05441, 2025a.https://arxiv.org/abs/2501.05441

    Yiwen Huang, Aaron Gokaslan, Volodymyr Kuleshov, and James Tompkin. The GAN is dead; long live the GAN! A modern GAN baseline.arXiv preprint arXiv:2501.05441, 2025a.https://arxiv.org/abs/2501.05441. Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, and Lu Sheng. Mv-adapter: Multi-view consistent image generation made easy. InProc...

  32. [33]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509,

  33. [34]

    Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models. arXiv preprint arXiv:2505.12705,

  34. [35]

    COLE: A hierarchical generation framework for graphic design.arXiv preprint arXiv:2311.16974,

    Peidong Jia, Chenxuan Li, Zeyu Liu, Yichao Shen, Xingru Chen, Yuhui Yuan, Yinglin Zheng, Dong Chen, Ji Li, Xiaodong Xie, et al. COLE: A hierarchical generation framework for graphic design.arXiv preprint arXiv:2311.16974,

  35. [36]

    Lego-edit: A general image editing framework with model-level bricks and mllm builder.ArXiv, abs/2509.12883, 2025.https://api.semanticscholar.org/CorpusID:281325583

    Qifei Jia, Yu Liu, Yajie Chai, Xintong Yao, Qiming Lu, Yasen Zhang, Runyu Shi, Ying Huang, and Guoquan Zhang. Lego-edit: A general image editing framework with model-level bricks and mllm builder.ArXiv, abs/2509.12883, 2025.https://api.semanticscholar.org/CorpusID:281325583. Bowen Jiang, Yuan Yuan, Xinyi Bai, Zhuoqun Hao, Alyson Yin, Yaojie Hu, Wenyu Liao...

  36. [37]

    pdf, accessed 2026-04-22

    Available athttps://joyai-image.s3.cn-north-1.jdcloud-oss.com/JoyAI-Image. pdf, accessed 2026-04-22. Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. InCVPR,

  37. [38]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks, 2019.https://arxiv.org/abs/1812.04948. 114 Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InAdvances in Neural Information Processing Systems, volume 35,

  38. [39]

    Gen2sim: Scaling up robot learning in simulation with generative models.arXiv preprint arXiv:2310.18308,

    Pushkal Katara, Zhou Xian, and Katerina Fragkiadaki. Gen2sim: Scaling up robot learning in simulation with generative models.arXiv preprint arXiv:2310.18308,

  39. [40]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

  40. [41]

    Tenenbaum

    Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B Tenenbaum. Learning to act from actionless videos through dense correspondences.arXiv preprint arXiv:2310.08576,

  41. [42]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

  42. [43]

    arXiv preprint arXiv:2508.09976 (2025)

    Marion Lepert, Jiaying Fang, and Jeannette Bohg. Masquerade: Learning from in-the-wild human videos using data-editing.arXiv preprint arXiv:2508.09976, 2025a. Marion Lepert, Jiaying Fang, and Jeannette Bohg. Phantom: Training robots without robots using only human videos. arXiv preprint arXiv:2503.00779, 2025b. Guangrun Li, Yaoxu Lyu, Zhuoyang Liu, Chengk...

  43. [44]

    Jarvisart: Liberating human artistic creativity via an intelligent photo retouching agent

    Yunlong Lin, Zixu Lin, Kunjie Lin, Jinbin Bai, Panwang Pan, Chenxin Li, Haoyu Chen, Zhongdao Wang, Xinghao Ding, Wenbo Li, et al. Jarvisart: Liberating human artistic creativity via an intelligent photo retouching agent. arXiv preprint arXiv:2506.17612,

  44. [45]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023.https://arxiv.org/abs/2210.02747. Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yi Xin, Xinyue Li, Qi Qin, Yu Qiao, Hongsheng Li, and Peng Gao. Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimod...

  45. [46]

    Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

    Blog post. Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081,

  46. [47]

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. InAdvances in Neural Information Processing Systems, volume 35, 2022a. Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of ...

  47. [48]

    Minxing Luo, Zixun Xia, Liaojun Chen, Zhenhang Li, Weichao Zeng, Jianye Wang, Wentao Cheng, Yaxing Wang, Yu ZHOU, and Jian Yang

    https://api.semanticscholar.org/CorpusID: 279070661. Minxing Luo, Zixun Xia, Liaojun Chen, Zhenhang Li, Weichao Zeng, Jianye Wang, Wentao Cheng, Yaxing Wang, Yu ZHOU, and Jian Yang. Beyond flat text: Dual self-inherited guidance for visual text generation.ArXiv, abs/2501.05892, 2025a.https://api.semanticscholar.org/CorpusID:275458598. Xin Luo, Jiahao Wang...

  48. [49]

    Glyphdraw2: Automatic generation of complex glyph posters with diffusion models and large language models

    117 Jian Ma, Yonglin Deng, Chen Chen, Nanyang Du, Haonan Lu, and Zhenyu Yang. Glyphdraw2: Automatic generation of complex glyph posters with diffusion models and large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 5955–5963, 2025a. Jiancang Ma, Xujie Zhu, Zihao Pan, Qirong Peng, Xu Guo, Chen Chen, and H...

  49. [50]

    Shapesplat: A large-scale dataset of gaussian splats and their self-supervised pretraining.arXiv preprint arXiv:2408.10906, 2024b

    Qi Ma, Yue Li, Bin Ren, Nicu Sebe, Ender Konukoglu, Theo Gevers, Luc Van Gool, and Danda Pani Paudel. Shapesplat: A large-scale dataset of gaussian splats and their self-supervised pretraining.arXiv preprint arXiv:2408.10906, 2024b. Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng,...

  50. [51]

    arXiv preprint arXiv:2508.15772 (2025)

    Qingyang Mao, Qi Cai, Yehao Li, Yingwei Pan, Mingyue Cheng, Ting Yao, Qi Liu, and Tao Mei. Visual autoregressive modeling for instruction-guided image editing.ArXiv, abs/2508.15772, 2025.https://api.semanticscholar.org/ CorpusID:280700028. Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adv...

  51. [52]

    arXiv preprint arXiv:2601.07823 , year=

    Zhiting Mei, Tenny Yin, Ola Shorinwa, Apurva Badithela, Zhonghe Zheng, Joseph Bruno, Madison Bland, Lihan Zha, Asher Hancock, Jaime Fernández Fisac, Philip Dames, and Anirudha Majumdar. Video generation models in robotics: Applications, research challenges, future directions.arXiv preprint arXiv:2601.07823,

  52. [53]

    Phybench: A physical commonsense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802,

    Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yiran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, et al. Phybench: A physical commonsense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802,

  53. [54]

    Spectral Normalization for Generative Adversarial Networks

    Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks, 2018.https://arxiv.org/abs/1802.05957. 118 MMagic Contributors. MMagic: OpenMMLab multimodal advanced, generative, and intelligent creation toolbox. https://github.com/open-mmlab/mmagic,

  54. [55]

    Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al

    Open-source repository. Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al. Dreamo: A unified framework for image customization. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–12,

  55. [56]

    Transition matching distillation for fast video generation.arXiv preprint arXiv:2601.09881, 2026

    Weili Nie, Julius Berner, Nanye Ma, Chao Liu, Saining Xie, and Arash Vahdat. Transition matching distillation for fast video generation.arXiv preprint arXiv:2601.09881,

  56. [57]

    Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265,

    Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265,

  57. [58]

    Improving robotic manipulation robustness via NICE scene surgery.arXiv preprint arXiv:2511.22777, 2025a

    Sajjad Pakdamansavoji, Mozhgan Pourkeshavarz, Adam Sigal, Zhiyuan Li, Rui Heng Yang, and Amir Rasouli. Improving robotic manipulation robustness via NICE scene surgery.arXiv preprint arXiv:2511.22777, 2025a. Sajjad Pakdamansavoji, Mozhgan Pourkeshavarz, Adam Sigal, Zhiyuan Li, Rui Heng Yang, and Amir Rasouli. Improving robotic manipulation robustness via ...

  58. [59]

    Scalable Diffusion Models with Transformers

    Available athttps://deepmind.google/blog/genie-2-a-large-scale-foundation-world-model/. William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.https://arxiv.org/abs/2212.09748. Xu Peng, Junwei Zhu, Boyuan Jiang, Ying Tai, Donghao Luo, Jiangning Zh...

  59. [60]

    arXiv preprint arXiv:2602.12529 , year=

    Bowen Ping, Chengyou Jia, Minnan Luo, Hangwei Qian, and Ivor Tsang. Flow-factory: A unified framework for reinforcement learning in flow-matching models.arXiv preprint arXiv:2602.12529,

  60. [61]

    Aligning text-to- image diffusion models with reward backpropagation (2023).arXiv preprint arXiv:2310.03739,

    Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint arXiv:2310.03739,

  61. [62]

    Art: Anonymous region transformer for variable multi-layer transparent image generation.arXiv preprint arXiv:2502.18364, 2025a

    119 Yifan Pu, Yiming Zhao, Zhicong Tang, Ruihong Yin, Haoxing Ye, Yuhui Yuan, Dong Chen, Jianmin Bao, Sirui Zhang, Yanbin Wang, Lin Liang, Lijuan Wang, Ji Li, Xiu Li, Zhouhui Lian, Gao Huang, and Baining Guo. Art: Anonymous region transformer for variable multi-layer transparent image generation.arXiv preprint arXiv:2502.18364, 2025a. Yifan Pu, Yiming Zha...

  62. [63]

    Du, Zehuan Yuan, and Xinglong Wu

    Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, and Xinglong Wu. TokenFlow: Unified image tokenizer for multimodal understanding and generation.arXiv preprint arXiv:2412.03069, 2024.https://arxiv.org/abs/2412.03069. Julian Quevedo, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: ...

  63. [64]

    Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

    Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks, 2016.https://arxiv.org/abs/1511.06434. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational conference...

  64. [65]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022.https://arxiv.org/abs/2112.10752. Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven genera...

  65. [66]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022.https://arxiv.org/abs/2205.11487. Tim...

  66. [67]

    Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, et al

    https://zenodo.org/records/17344183. Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, et al. Imagenworld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks. InThe Fourteenth International Con...

  67. [68]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  68. [69]

    Diff2flow: Training flow matching models via diffusion model alignment, 2025.https://arxiv.org/abs/2506.02221

    Johannes Schusterbauer, Ming Gui, Frank Fundel, and Björn Ommer. Diff2flow: Training flow matching models via diffusion model alignment, 2025.https://arxiv.org/abs/2506.02221. 120 Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimod...

  69. [70]

    Post-training quantization on diffusion models

    Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1972–1981,

  70. [71]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  71. [72]

    IMAGHar- mony: Controllable image editing with consistent object quantity and layout,

    Fei Shen, Xiaoyu Du, Yutong Gao, Jian Yu, Yushe Cao, Xing Lei, and Jinhui Tang. Imagharmony: Controllable image editing with consistent object quantity and layout.ArXiv, abs/2506.01949, 2025a.https://api.semanticscholar. org/CorpusID:279119734. Yang Shen, Xiu-Shen Wei, Yifan Sun, Yuxin Song, Tao Yuan, Jian Jin, Heyang Xu, Yazhou Yao, and Errui Ding. Expla...

  72. [73]

    Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963,

    Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, and Baining Guo. Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963, 2025b. Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instantbooth: Personalized text-to-image generation without test-time finetuning. InProceeding...

  73. [74]

    Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025

    Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025a. https://arxiv.org/abs/2510.15301. Wenda Shi, Yiren Song, Zihan Rao, Dengming Zhang, Jiaming Liu, and Xingxing Zou. Wordcon: Word-level typography ...

  74. [75]

    Chimera: Compositional image generation using part-based concepting, 2025.https://arxiv.org/abs/2510.18083

    Shivam Singh, Yiming Chen, Agneet Chatterjee, Amit Raj, James Hays, Yezhou Yang, and Chitta Baral. Chimera: Compositional image generation using part-based concepting, 2025.https://arxiv.org/abs/2510.18083. Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2021a. Yan...

  75. [76]

    Mitty: Diffusion-based human-to- robot video generation.arXiv preprint arXiv:2512.17253, 2025

    Yiren Song, Cheng Liu, Weijia Mao, and Mike Zheng Shou. Mitty: Diffusion-based human-to-robot video generation. arXiv preprint arXiv:2512.17253, 2025a. Yuxin Song, Wenkai Dong, Shizun Wang, Qi Zhang, Song Xue, Tao Yuan, Hu Yang, Haocheng Feng, Hang Zhou, Xinyan Xiao, et al. Query-kontext: An unified multimodal model for image generation and editing.arXiv ...

  76. [78]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    https: //arxiv.org/abs/2406.06525. Peter Sushko, Ayana Bharadwaj, Zhi Yang Lim, Vasily Ilin, Ben Caffee, Dongping Chen, Mohammadreza Salehi, Cheng- Yu Hsieh, and Ranjay Krishna. Realedit: Reddit edits as a large-scale empirical dataset for image transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p...

  77. [79]

    Understanding generative ai capabilities in everyday image editing tasks.arXiv preprint arXiv:2505.16181,

    Mohammad Reza Taesiri, Brandon Collins, Logan Bolton, Viet Dac Lai, Franck Dernoncourt, Trung Bui, and Anh Totti Nguyen. Understanding generative ai capabilities in everyday image editing tasks.arXiv preprint arXiv:2505.16181,

  78. [80]

    Instantcharacter: Personalize any characters with a scalable diffusion transformer frame- work.arXiv preprint arXiv:2504.12395, 2025

    Jiale Tao, Yanbing Zhang, Qixun Wang, Yiji Cheng, Haofan Wang, Xu Bai, Zhengguang Zhou, Ruihuang Li, Linqing Wang, Chunyu Wang, Qin Lin, and Qinglin Lu. Instantcharacter: Personalize any characters with a scalable diffusion transformer framework.ArXiv, abs/2504.12395, 2025a. https://api.semanticscholar.org/CorpusID:277856764. Tang Tao, Likui Zhang, Youpen...

  79. [81]

    Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837,

  80. [82]

    DataVisT5: A pre-trained language model for jointly understanding text and data visualization

    Zhuoyue Wan, Yuanfeng Song, Shuaimin Li, Chen Jason Zhang, and Raymond Chi-Wing Wong. DataVisT5: A pre-trained language model for jointly understanding text and data visualization. In41st IEEE International Conference on Data Engineering, ICDE 2025, pages 1704–1717. IEEE,

Showing first 80 references.