pith. sign in

arxiv: 2605.30248 · v2 · pith:MWJEYFE6new · submitted 2026-05-28 · 💻 cs.CV

GenClaw: Code-Driven Agentic Image Generation

Pith reviewed 2026-06-29 07:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords agentic image generationcode-driven generationcontrollable visual synthesisSVG code renderingmultimodal agentsstaged image creationinterpretable generation
0
0 comments X

The pith

GenClaw turns AI image generation into a staged process by inserting executable code sketches between reasoning and pixel synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GenClaw as a workflow in which an agent first builds conceptual context through search and reasoning, then writes code such as SVG, HTML, or ThreeJS to produce an executable sketch, and finally applies an image model to add textures and realism. This inserts code as a controllable intermediate layer that lets the agent directly manipulate structure instead of cycling through prompt revisions alone. The approach aims to make generation more like human creation, where planning, sketching, and coloring occur in sequence. A sympathetic reader would care because it addresses the lack of direct canvas control in current multimodal agents that depend entirely on black-box image models.

Core claim

GenClaw is a code-driven agentic image generation paradigm that empowers the agent to create like a human artist: first conceptualizing, then sketching, and finally coloring. Specifically, the agent first constructs the conceptual knowledge and context through search and reasoning. It then utilizes code (e.g., SVG, HTML, ThreeJS) to render executable visual sketches. Finally, it employs an image generation model to supplement textures, materials, and photorealism. In this workflow, code serves as a controllable intermediate canvas bridging linguistic reasoning and pixel synthesis.

What carries the argument

Code as a controllable intermediate canvas that bridges the agent's linguistic reasoning and downstream pixel synthesis in a three-stage workflow.

If this is right

  • Agents gain direct editability over visual structure by modifying the code sketch rather than relying solely on repeated prompt adjustments.
  • The generation process gains interpretability because the code layer exposes the agent's reasoning in a readable and revisable form.
  • Programmatic control over layout and elements can be combined with the strengths of generative models for photorealistic output.
  • The workflow reduces dependence on black-box refinement loops by providing an explicit intermediate representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If code-generation reliability increases, the same staged approach could apply to domains such as 3D scene creation or animation where intermediate representations already exist.
  • The separation of structure via code from appearance via image models could reduce the need to retrain large image generators for better structural fidelity.
  • Inspecting and editing the code canvas might offer a practical debugging path for generation failures that current prompt-only systems lack.

Load-bearing premise

Large language models can generate accurate, executable code that correctly captures the agent's intended visual concept and that this code combines cleanly with image models.

What would settle it

An experiment in which the code produced by the agent repeatedly fails to match the planned concept or introduces visual errors that the final image model cannot resolve.

read the original abstract

Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existing agents remain at the mercy of underlying black-box image models. Their workflow is trapped in a repetitive cycle of prompt rewriting for generation refinement, leaving them with no mechanism to directly manipulate the canvas. In essence, the potential of LLMs to serve as a genuine "brush" for precise visual construction remains largely untapped. In this paper, we propose GenClaw, a code-driven agentic image generation paradigm that empowers the agent to create like a human artist: first conceptualizing, then sketching, and finally coloring. Specifically, the agent first constructs the conceptual knowledge and context through search and reasoning. It then utilizes code (e.g., SVG, HTML, ThreeJS) to render executable visual sketches. Finally, it employs an image generation model to supplement textures, materials, and photorealism. In this workflow, code serves as a controllable intermediate canvas bridging linguistic reasoning and pixel synthesis, seamlessly integrating programmatic logic with the visual expressiveness of generative models. By transforming image generation from a black-box paradigm into a staged process akin to authentic human creation, GenClaw offers a step toward for highly controllable and interpretable visual generation systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes GenClaw, a code-driven agentic image generation paradigm in which an LLM agent first builds conceptual knowledge via search and reasoning, then renders an executable visual sketch using code (SVG, HTML, or ThreeJS), and finally applies an image generation model to add textures, materials, and photorealism. Code is positioned as a controllable intermediate canvas that bridges linguistic reasoning and pixel synthesis, transforming black-box generation into a staged, human-like process for greater controllability and interpretability.

Significance. If the workflow can be shown to work reliably, the staged code-mediated approach could meaningfully improve controllability and interpretability in agentic image generation by allowing direct programmatic manipulation of visual structure before photorealistic refinement. This would represent a conceptual advance over purely prompt-based black-box agents.

major comments (3)
  1. Abstract: The central claim that GenClaw 'empowers the agent to create like a human artist' and offers 'a step toward highly controllable and interpretable visual generation systems' is unsupported because the manuscript contains no experiments, ablation studies, user evaluations, quantitative metrics, or even implementation details demonstrating that the proposed workflow achieves these benefits.
  2. Abstract: The proposal rests on the untested assumption that current LLMs can reliably emit correct, intent-preserving executable code (SVG/HTML/ThreeJS) that accurately captures conceptual reasoning; no failure-mode analysis, error rates, or comparison against black-box baselines is provided to substantiate this.
  3. Abstract: No evidence or discussion is given on whether the code-to-image handoff preserves control or introduces new artifacts, which is load-bearing for the claim that code serves as a 'seamlessly integrating' controllable intermediate representation.
minor comments (1)
  1. Abstract: The sentence 'offers a step toward for highly controllable' contains a grammatical error and should be corrected.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The work presents a conceptual proposal for a code-mediated workflow rather than an empirical study. We will revise the abstract and add a dedicated limitations and future work section to ensure claims are appropriately scoped and to outline evaluation directions.

read point-by-point responses
  1. Referee: Abstract: The central claim that GenClaw 'empowers the agent to create like a human artist' and offers 'a step toward highly controllable and interpretable visual generation systems' is unsupported because the manuscript contains no experiments, ablation studies, user evaluations, quantitative metrics, or even implementation details demonstrating that the proposed workflow achieves these benefits.

    Authors: We agree that the current abstract language overstates demonstrated outcomes. The manuscript introduces the staged workflow as a conceptual paradigm; the quoted phrases describe intended properties of the approach. In revision we will replace these with more measured wording (e.g., 'aims to empower' and 'potentially offers a step toward') and add an explicit statement that empirical validation remains future work. revision: yes

  2. Referee: Abstract: The proposal rests on the untested assumption that current LLMs can reliably emit correct, intent-preserving executable code (SVG/HTML/ThreeJS) that accurately captures conceptual reasoning; no failure-mode analysis, error rates, or comparison against black-box baselines is provided to substantiate this.

    Authors: The manuscript does rely on the premise that LLMs can produce usable code sketches, drawing from observed capabilities in related literature, but provides no dedicated analysis of failure modes. We will add a new subsection under Limitations that enumerates known risks (syntax errors, semantic drift, style mismatch) and sketches how future controlled studies could quantify them against direct image-generation baselines. revision: yes

  3. Referee: Abstract: No evidence or discussion is given on whether the code-to-image handoff preserves control or introduces new artifacts, which is load-bearing for the claim that code serves as a 'seamlessly integrating' controllable intermediate representation.

    Authors: We concur that the handoff step is central and currently undiscussed. Revision will include a short analysis of the transition, noting that the image model receives both the rendered sketch and a textual prompt derived from the same reasoning trace, and will flag potential artifacts (e.g., loss of precise geometry, texture hallucination). We will also qualify the term 'seamlessly integrating' to 'intended to integrate'. revision: yes

Circularity Check

0 steps flagged

No circularity: conceptual paradigm proposal without derivations or self-referential reductions

full rationale

The manuscript is a forward-looking proposal for a code-driven agentic workflow (conceptualization → code sketch → image refinement) with no equations, fitted parameters, predictions, or load-bearing self-citations. No step reduces by construction to its own inputs, as there are no quantitative claims, uniqueness theorems, or ansatzes to inspect. The central claim remains a descriptive suggestion rather than a derived result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested domain assumption that LLMs can produce reliable executable visual code and that this code meaningfully improves controllability over direct pixel synthesis.

axioms (1)
  • domain assumption LLMs can reliably generate correct executable code (SVG, HTML, ThreeJS) that captures conceptual knowledge for visual sketches
    The workflow in the abstract depends on this capability for the sketching stage to function as a controllable intermediate.

pith-pipeline@v0.9.1-grok · 5767 in / 1325 out tokens · 44744 ms · 2026-06-29T07:35:01.736989+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 38 canonical work pages · 18 internal anchors

  1. [1]

    Claude.https://www.anthropic.com/claude, 2024

    Anthropic. Claude.https://www.anthropic.com/claude, 2024. Accessed: 2026-05-07

  2. [2]

    Flux2max: Nextgenerationimagesynthesis

    BlackForestLabs. Flux2max: Nextgenerationimagesynthesis. https://bfl.ai/models/flux-2-max,

  3. [3]

    Accessed: 2026-01-26

  4. [4]

    Flux 2 pro: State-of-the-art quality at maximum speed.https://bfl.ai/models/flux -2, 2026

    Black Forest Labs. Flux 2 pro: State-of-the-art quality at maximum speed.https://bfl.ai/models/flux -2, 2026. Accessed: 2026-01-26

  5. [5]

    FLUX.2 [klein]: Towards Interactive Visual Intelligence.https://bfl.ai/blog/flux 2-klein-towards-interactive-visual-intelligence, 2026

    Black Forest Labs. FLUX.2 [klein]: Towards Interactive Visual Intelligence.https://bfl.ai/blog/flux 2-klein-towards-interactive-visual-intelligence, 2026. Accessed: 2026-05-07

  6. [6]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

  7. [7]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025

  8. [8]

    Unify-agent: A unified multimodal agent for world-grounded image synthesis.arXiv preprint arXiv:2603.29620, 2026

    Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi-Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, et al. Unify-agent: A unified multimodal agent for world-grounded image synthesis.arXiv preprint arXiv:2603.29620, 2026

  9. [9]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  10. [10]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

  11. [11]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  12. [12]

    Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

  13. [13]

    DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

    DeepSeek-AI. DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence. Technical Report, 2026

  14. [15]

    Emerging Properties in Unified Multimodal Pretraining

    ChaoruiDeng,DeyaoZhu,KunchangLi,ChenhuiGou,FengLi,ZeyuWang,ShuZhong,WeihaoYu,XiaonanNie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  15. [16]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  16. [17]

    Gen-Searcher: Reinforcing Agentic Search for Image Generation

    Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, and Xiangyu Yue. Gen-searcher: Reinforcing agentic search for image generation.arXiv preprint arXiv:2603.28767, 2026

  17. [18]

    X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

    Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025. 18

  18. [19]

    Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

  19. [20]

    Gemini 2.0 flash.https://developers.googleblog.com/en/experiment-with-gem ini-20-flash-native-image-generation, 2025

    Google. Gemini 2.0 flash.https://developers.googleblog.com/en/experiment-with-gem ini-20-flash-native-image-generation, 2025

  20. [21]

    Gemini 3: Introducing the latest gemini ai model from google.https://blog.google/products /gemini/gemini-3/, 2025

    Google. Gemini 3: Introducing the latest gemini ai model from google.https://blog.google/products /gemini/gemini-3/, 2025. Released November 18, 2025. Accessed: 2026-05-20

  21. [22]

    Gemini image pro: High-quality image generation.https://deepmind.google/mode ls/gemini-image/pro/, 2025

    Google DeepMind. Gemini image pro: High-quality image generation.https://deepmind.google/mode ls/gemini-image/pro/, 2025. Accessed: 2026-01-26

  22. [23]

    Gemini image: High-quality image generation.https://deepmind.google/models /gemini-image/flash/, 2025

    Google DeepMind. Gemini image: High-quality image generation.https://deepmind.google/models /gemini-image/flash/, 2025. Accessed: 2026-01-26

  23. [24]

    Controlling your image via simplified vector graphics.arXiv preprint arXiv:2602.14443, 2026

    Lanqing Guo, Xi Liu, Yufei Wang, Zhihao Li, and Siyu Huang. Controlling your image via simplified vector graphics.arXiv preprint arXiv:2602.14443, 2026

  24. [25]

    Mind-brush: Integrating agentic cognitive search and reasoning into image generation,

    Jun He, Junyan Ye, Zilong Huang, Dongzhi Jiang, Chenjue Zhang, Leqi Zhu, Renrui Zhang, Xiang Zhang, and Weijia Li. Mind-brush: Integrating agentic cognitive search and reasoning into image generation. 2026. URL https://arxiv.org/abs/2602.01756

  25. [26]

    T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.ArXiv, abs/2505.00703, 2025

    Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hongsheng Li. T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

  26. [27]

    Draco: Draft as cot for text-to-image preview and rare concept generation

    Dongzhi Jiang, Renrui Zhang, Haodong Li, Zhuofan Zong, Ziyu Guo, Jun He, Claire Guo, Junyan Ye, Rongyao Fang, Weijia Li, et al. Draco: Draft as cot for text-to-image preview and rare concept generation.arXiv preprint arXiv:2512.05112, 2025

  27. [28]

    Genagent: Scaling text-to-image generation via agentic multimodal reasoning, 2026

    Kaixun Jiang, Yuzheng Wang, Junjie Zhou, Pandeng Li, Zhihang Liu, Chen-Wei Xie, Zhaoyu Chen, Yun Zheng, and Wenqiang Zhang. Genagent: Scaling text-to-image generation via agentic multimodal reasoning, 2026. URL https://arxiv.org/abs/2601.18543

  28. [29]

    Segmentanything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead,AlexanderCBerg,Wan-YenLo,etal. Segmentanything. InProceedingsoftheIEEE/CVFInternational Conference on Computer Vision, pages 4015–4026, 2023

  29. [30]

    Think-then-generate: Reasoning-aware text-to-image diffusion with llm encoders.arXiv preprint arXiv:2601.10332, 2026

    Siqi Kou, Jiachun Jin, Zetong Zhou, Ye Ma, Yugang Wang, Quan Chen, Peng Jiang, Xiao Yang, Jun Zhu, Kai Yu, et al. Think-then-generate: Reasoning-aware text-to-image diffusion with llm encoders.arXiv preprint arXiv:2601.10332, 2026

  30. [31]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  31. [32]

    Flux.1 kontext: Flow matching for in-context image generation and editing in latent space

    Black Forest Labs. Flux.1 kontext: Flow matching for in-context image generation and editing in latent space. 2025

  32. [33]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

  33. [34]

    Coco: Code as cot for text-to-image preview and rare concept generation

    Haodong Li, Chunmei Qing, Huanyu Zhang, Dongzhi Jiang, Yihang Zou, Hongbo Peng, Dingming Li, Yuhong Dai, ZePeng Lin, Juanxi Tian, et al. Coco: Code as cot for text-to-image preview and rare concept generation. arXiv preprint arXiv:2603.08652, 2026

  34. [35]

    Crossviewdiff: Across-viewdiffusionmodelforsatellite-to-streetviewsynthesis.arXivpreprintarXiv:2408.14765, 2024

    Weijia Li, Jun He, Junyan Ye, Huaping Zhong, Zhimeng Zheng, Zilong Huang, Dahua Lin, and Conghui He. Crossviewdiff: Across-viewdiffusionmodelforsatellite-to-streetviewsynthesis.arXivpreprintarXiv:2408.14765, 2024. 19

  35. [36]

    Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

    Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Shaodong Wang, Xinhua Cheng, and Li Yuan. Uniworld-v2: Reinforce image editing with diffusion negative-awarefinetuningandmllmimplicitfeedback.2025.URL https://arxiv.org/abs/2510.16888

  36. [37]

    An llm-lvlm driven agent for iterative and fine-grained image editing

    Zihan Liang, Jiahao Sun, and Haoran Ma. An llm-lvlm driven agent for iterative and fine-grained image editing. arXiv preprint arXiv:2508.17435, 2025

  37. [38]

    Vcode: a multimodal coding benchmark with svg as symbolic visual representation.arXiv preprint arXiv:2511.02778, 2025

    Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, and Alex Jinpeng Wang. Vcode: a multimodal coding benchmark with svg as symbolic visual representation.arXiv preprint arXiv:2511.02778, 2025

  38. [39]

    Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

    Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, et al. Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

  39. [40]

    Gpt-image-1: Models and capabilities for image generation.https://platform.openai.com/ docs/models/gpt-image-1, 2024

    OpenAI. Gpt-image-1: Models and capabilities for image generation.https://platform.openai.com/ docs/models/gpt-image-1, 2024. Accessed: 2026-01-29

  40. [41]

    Gpt-4o.https://openai.com/index/introducing-4o-image-generation, 2025

    OpenAI. Gpt-4o.https://openai.com/index/introducing-4o-image-generation, 2025

  41. [42]

    Gpt-image-1.5: Enhanced visual reasoning and creative generation.https://platform.openai

    OpenAI. Gpt-image-1.5: Enhanced visual reasoning and creative generation.https://platform.openai. com/docs/models/gpt-image-1.5, 2025. Accessed: 2026-01-29

  42. [43]

    GPT-Image-2

    OpenAI. GPT-Image-2. https://developers.openai.com/api/docs/models/gpt-image-2 , 2026

  43. [44]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  44. [45]

    Lumina-image 2.0: A unified and efficient image generative framework.arXiv preprint arXiv:2503.21758, 2025

    Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina-image 2.0: A unified and efficient image generative framework.arXiv preprint arXiv:2503.21758, 2025

  45. [46]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022

  46. [47]

    SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

    Tianfei Ren, Zhipeng Yan, Yiming Zhao, Zhen Fang, Yu Zeng, Guohui Zhang, Hang Xu, Xiaoxiao Ma, Shiting Huang, Ke Xu, et al. Scope: Structured decomposition and conditional skill orchestration for complex image generation.arXiv preprint arXiv:2605.08043, 2026

  47. [48]

    High-resolution image synthesiswithlatentdiffusionmodels

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesiswithlatentdiffusionmodels. InProceedingsoftheIEEE/CVFconferenceoncomputervisionandpattern recognition, pages 10684–10695, 2022

  48. [49]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    TeamSeedream,YunpengChen,YuGao,LixueGong,MengGuo,QiushanGuo,ZhiyaoGuo,XiaoxiaHou,Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao,LiyangLiu,WeiLiu,YanzuoLu,ZhengxiongLuo,TongtongOu,GuangShi,YichunShi,ShiqiSun,YuTian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wen...

  49. [50]

    Stable diffusion 3.5 large.https://huggingface.co/stabilityai/stable-diffusi on-3.5-large, 2024

    Stability AI. Stable diffusion 3.5 large.https://huggingface.co/stabilityai/stable-diffusi on-3.5-large, 2024

  50. [51]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  51. [52]

    Qwen Team

    Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

  52. [53]

    Three.js: Javascript 3d library.https://threejs.org, 2024

    Three.js Authors. Three.js: Javascript 3d library.https://threejs.org, 2024. Accessed: 2026-05-07. 20

  53. [54]

    Internsvg: Towards unified svg tasks with multimodal large language models

    Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, et al. Internsvg: Towards unified svg tasks with multimodal large language models. arXiv preprint arXiv:2510.11341, 2025

  54. [55]

    Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting.arXiv preprint arXiv:2509.04545, 2025

    Linqing Wang, Ximing Xing, Yiji Cheng, Zhiyuan Zhao, Donghao Li, Tiankai Hang, Jiale Tao, Qixun Wang, Ruihuang Li, Comi Chen, et al. Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting.arXiv preprint arXiv:2509.04545, 2025

  55. [56]

    Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.Advances in Neural Information Processing Systems, 38:58972–59005, 2026

    Siwei Wen, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Conghui He, Weijia Li, et al. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.Advances in Neural Information Processing Systems, 38:58972–59005, 2026

  56. [57]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  57. [58]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

  58. [59]

    CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning

    YuhuiWu,ChenxiXie,RuibinLi,LiyiChen,QiaosiYi,andLeiZhang. Cocoedit: Content-consistentimageediting via region regularized reinforcement learning.ArXiv, abs/2602.14068, 2026. doi: 10.48550/arXiv.2602.14068. URLhttps://arxiv.org/abs/2602.14068

  59. [60]

    Omnisvg: A unified scalable vector graphics generation model.Advances in Neural Information Processing Systems, 38:113670–113696, 2026

    Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Fukun Yin, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, and Yu-Gang Jiang. Omnisvg: A unified scalable vector graphics generation model.Advances in Neural Information Processing Systems, 38:113670–113696, 2026

  60. [61]

    Leveraging bev paradigm for ground-to-aerial image synthesis

    Junyan Ye, Jun He, Weijia Li, Zhutao Lv, Yi Lin, Jinhua Yu, Haote Yang, and Conghui He. Leveraging bev paradigm for ground-to-aerial image synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 28451–28461, 2025

  61. [62]

    Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation,

    Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation. arXiv preprint arXiv:2508.09987, 2025

  62. [63]

    Loki: A comprehensive synthetic data detection benchmark using large multimodal models

    Junyan Ye, Baichuan Zhou, Zilong Huang, Junan Zhang, Tianyi Bai, Hengrui Kang, Jun He, Honglin Lin, Zihao Wang, Tong Wu, et al. Loki: A comprehensive synthetic data detection benchmark using large multimodal models. InInternational Conference on Learning Representations, volume 2025, pages 70440–70522, 2025

  63. [64]

    Realgen: Photorealistic text-to-image generation via detector-guided rewards,

    Junyan Ye, Leiqi Zhu, Yuncheng Guo, Dongzhi Jiang, Zilong Huang, Yifan Zhang, Zhiyuan Yan, Haohuan Fu, Conghui He, and Weijia Li. Realgen: Photorealistic text-to-image generation via detector-guided rewards.arXiv preprint arXiv:2512.00473, 2025

  64. [65]

    ImgEdit: A Unified Image Editing Dataset and Benchmark

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

  65. [66]

    Qwen-image-layered: Towards inherent editability via layer decomposition.arXiv preprint arXiv:2512.15603, 2025

    Shengming Yin, Zekai Zhang, Zecheng Tang, Kaiyuan Gao, Xiao Xu, Kun Yan, Jiahao Li, Yilei Chen, Yuxiang Chen,Heung-YeungShum,etal. Qwen-image-layered: Towardsinherenteditabilityvialayerdecomposition.arXiv preprint arXiv:2512.15603, 2025

  66. [67]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 21