pith. machine review for the scientific record. sign in

arxiv: 2604.19632 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords graphic design parsingraster to layersgenerative decompositiondiffusion modelsvision-language modelseditable layerslayer extraction
0
0 comments X

The pith

A hybrid generative model parses flat graphic designs into separately editable text, background, and sticker layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to replace multi-stage pipelines for breaking down raster design images with a single generative process that outputs editable layers. Existing approaches suffer from accumulated errors when predicting layouts then filling in details separately. The new framework handles text through a vision-language model that produces a re-editable rendering protocol while using diffusion branches to create transparent background and sticker layers. A preference reward model trained with group relative policy optimization steers the outputs toward human-preferred designs. If the approach works as claimed, designers would gain direct control to change text or rearrange elements in existing images without regenerating the whole composition from scratch.

Core claim

We propose a hybrid generative framework that decomposes a raster graphic design image into an editable text layer via a vision-language model outputting a text rendering protocol, plus background and sticker layers via a multi-branch diffusion architecture with RGBA channels. ParserReward is introduced and combined with Group Relative Policy Optimization to align the generated layers with human design preferences, yielding better results than prior methods on the Parser-40K and Crello datasets.

What carries the argument

The hybrid generative parsing architecture that routes text regions through a vision-language model to extract a re-usable rendering protocol while generating background and sticker layers in a multi-branch diffusion model with RGBA support, all refined by ParserReward scoring under Group Relative Policy Optimization.

If this is right

  • Text elements can be altered by editing the rendering protocol while leaving background and sticker layers untouched.
  • Background and sticker elements are produced with explicit transparency so they can be moved or removed independently.
  • The single generative pass avoids the error buildup that occurs when layout prediction, matting, and inpainting run in sequence.
  • Preference alignment through the reward model reduces artifacts that commonly appear in layer decompositions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same layer-separation logic could be applied to user-interface screenshots or illustration files to support automated editing workflows.
  • Retraining the reward model on designer feedback from a specific software tool might improve results for that tool's typical output styles.
  • Once layers are extracted, the parsed text protocol could feed directly into vector graphics editors for further refinement.

Load-bearing premise

The ParserReward model together with Group Relative Policy Optimization reliably encodes human design preferences and the chosen datasets represent the full variety of graphic designs without major unseen styles.

What would settle it

A new test collection of graphic designs drawn from sources outside the Parser-40K and Crello datasets where the hybrid method produces layers that are harder to edit or visually worse than outputs from existing layout-plus-matting pipelines.

Figures

Figures reproduced from arXiv: 2604.19632 by Dexiang Hong, Lei Zhang, Weidong Chen, Xinyan Liu, Yongdong Zhang, Yutao Cheng, Zhendong Mao.

Figure 1
Figure 1. Figure 1: Illustration of graphic design image parsing. Given a [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed CreatiParser framework. The framework comprises three components: (a) VLM-based Text Layer Parsing module (upper left), where a QwenLM-based multimodal decoder with LoRA generates text rendering protocols from the input graphic design, then with render engine to generate text layer; (b) Multi-branch Diffusion module (lower left), where three SDXL U-Net branches with Layer Token Att… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of Layer Token Attention (LTA). Tokens [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: From left to right column, we show: (a) the input poster image; (b) the reconstructed design by compositing the parsed [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of background layer generation quality by [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Generalization comparison between CreatiParser and LayerD [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Effect of GRPO group size K on performance. The dashed line denotes the baseline without GRPO (K = 1). provides font identification (87.3%) and style attribute predic￾tion (91.2%) that other methods cannot support. The GRPO optimization further improves all text metrics by 0.8–4.9%. Zero-shot generalization: On the Crello dataset, which is not seen during training, the proposed method CreatiParser maintain… view at source ↗
Figure 9
Figure 9. Figure 9: Effect of LoRA rank in the multi-branch diffusion [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Effect of LoRA rank in Qwen3-VL on text parsing [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
read the original abstract

Graphic design images consist of multiple editable layers, such as text, background, and decorative elements, while most generative models produce rasterized outputs without explicit layer structures, limiting downstream editing. Existing graphic design parsing methods typically rely on multi-stage pipelines combining layout prediction, matting, and inpainting, which suffer from error accumulation and limited controllability. We propose a hybrid generative framework for raster-to-layer graphic design parsing that decomposes a design image into editable text, background, and sticker layers. Text regions are parsed using a vision-language model into a text rendering protocol, enabling faithful reconstruction and flexible re-editing, while background and sticker layers are generated using a multi-branch diffusion architecture with RGBA support. We further introduce ParserReward and integrate it with Group Relative Policy Optimization to align generation quality with human design preferences. Extensive experiments on two challenging datasets, \emph{i.e.,} the Parser-40K and Crello datasets, demonstrate superior performance over existing methods, \emph{eg.,} achieving an overall average improvement of 23.7\% across all metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents CreatiParser, a hybrid generative framework for raster-to-layer parsing of graphic designs. It uses a vision-language model to parse text regions into a rendering protocol for editable text, a multi-branch diffusion model with RGBA support for background and sticker layers, and introduces ParserReward combined with Group Relative Policy Optimization (GRPO) to align the generation with human design preferences. The approach is evaluated on the Parser-40K and Crello datasets, claiming an overall average improvement of 23.7% across all metrics compared to existing methods.

Significance. If the empirical results hold under proper controls, this work could meaningfully advance controllable generative parsing for graphic design by producing explicitly editable layer decompositions instead of flat raster outputs. The hybrid VLM-plus-diffusion pipeline augmented by a custom reward model for preference alignment is a timely direction that addresses error accumulation in multi-stage baselines and limited editability in standard generative models.

major comments (1)
  1. Abstract: The headline claim of an 'overall average improvement of 23.7% across all metrics' on Parser-40K and Crello is load-bearing for the paper's contribution, yet the abstract supplies no concrete metric definitions, baseline names, per-metric scores, or ablation results isolating ParserReward + GRPO. Without these details it is impossible to determine whether the reported gains arise from the new alignment step or from the base VLM/diffusion architecture, directly undermining the central empirical assertion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract's presentation of results. We address the major comment below and will revise the manuscript accordingly to improve clarity.

read point-by-point responses
  1. Referee: [—] Abstract: The headline claim of an 'overall average improvement of 23.7% across all metrics' on Parser-40K and Crello is load-bearing for the paper's contribution, yet the abstract supplies no concrete metric definitions, baseline names, per-metric scores, or ablation results isolating ParserReward + GRPO. Without these details it is impossible to determine whether the reported gains arise from the new alignment step or from the base VLM/diffusion architecture, directly undermining the central empirical assertion.

    Authors: We agree that the abstract presents the 23.7% average improvement at a high level without sufficient context. The full experimental section provides the requested details: Table 2 reports per-metric scores (PSNR, SSIM, LPIPS, FID, and editability metrics) for each layer type on both Parser-40K and Crello; the baselines are the prior state-of-the-art graphic design parsing methods enumerated in Section 4.1; and Table 5 contains the ablation study isolating ParserReward + GRPO, showing that the alignment components contribute an additional 8-12% relative gain beyond the base VLM-diffusion pipeline. The 23.7% figure is the mean relative improvement across all metrics and datasets. To address the concern directly, we will revise the abstract to (1) name the primary baselines, (2) briefly categorize the metrics, and (3) note that ablations confirm the alignment step's contribution. This keeps the abstract concise while allowing readers to evaluate the source of the gains without immediately consulting the full results. revision: yes

Circularity Check

0 steps flagged

No circularity in the empirical pipeline

full rationale

The paper presents an empirical hybrid generative framework that decomposes raster designs into editable layers via a VLM for text rendering and multi-branch diffusion for RGBA layers, with ParserReward integrated via Group Relative Policy Optimization for preference alignment. No equations, derivations, or first-principles results are described that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Performance claims rest on experiments against external benchmarks (Parser-40K and Crello), making the approach self-contained without any load-bearing circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that graphic designs decompose cleanly into text/background/sticker layers and on the effectiveness of the new ParserReward component; no free parameters or external benchmarks are specified in the abstract.

axioms (1)
  • domain assumption Graphic designs can be usefully decomposed into independent text, background, and sticker layers
    This decomposition is the foundation of the entire parsing framework described in the abstract.
invented entities (1)
  • ParserReward no independent evidence
    purpose: To score and align generated layers with human design preferences via GRPO
    Newly introduced reward model whose details and validation are not provided in the abstract.

pith-pipeline@v0.9.0 · 5504 in / 1388 out tokens · 45817 ms · 2026-05-10T03:16:04.834439+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 7 canonical work pages · 5 internal anchors

  1. [1]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    T. Seedream, Y . Chen, Y . Gao, L. Gong, M. Guo, Q. Guo, Z. Guo, X. Hou, W. Huang, Y . Huanget al., “Seedream 4.0: To- ward next-generation multimodal image generation,”arXiv preprint arXiv:2509.20427, 2025

  2. [2]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    H. Cai, S. Cao, R. Du, P. Gao, S. Hoi, S. Huang, Z. Hou, D. Jiang, X. Jin, L. Liet al., “Z-image: An efficient image generation foun- dation model with single-stream diffusion transformer,”arXiv preprint arXiv:2511.22699, 2025

  3. [3]

    Cookgalip: Recipe con- trollable generative adversarial clips with sequential ingredient prompts for food image generation,

    M. Xu, J. Wang, M. Tao, B.-K. Bao, and C. Xu, “Cookgalip: Recipe con- trollable generative adversarial clips with sequential ingredient prompts for food image generation,”IEEE Transactions on Multimedia, 2024. 10

  4. [4]

    Mgdefect: A mask-guided high-quality defect image generation method for improving defect inspection,

    X. Jiang, Y . Li, F. Yan, Y . Lu, C. Xu, and M. Xu, “Mgdefect: A mask-guided high-quality defect image generation method for improving defect inspection,”IEEE Transactions on Multimedia, 2025

  5. [5]

    Semantic distance adversarial learning for text-to-image synthesis,

    B. Yuan, Y . Sheng, B.-K. Bao, Y .-P. P. Chen, and C. Xu, “Semantic distance adversarial learning for text-to-image synthesis,”IEEE Trans- actions on Multimedia, vol. 26, pp. 1255–1266, 2023

  6. [6]

    Few-shot genera- tive model adaptation via style-guided prompt,

    S. Pan, Z. Zhang, K. Wei, X. Yang, and C. Deng, “Few-shot genera- tive model adaptation via style-guided prompt,”IEEE Transactions on Multimedia, vol. 26, pp. 7661–7672, 2024

  7. [7]

    Decomposition of graphic design with unified multimodal model,

    H. Nie, Z. Zhang, Y . Cheng, M. Yang, G. Shi, Q. Xie, J. Shao, and X. Wu, “Decomposition of graphic design with unified multimodal model,” inF orty-second International Conference on Machine Learning, 2025

  8. [8]

    Re- thinking layered graphic design generation with a top-down approach,

    J. Chen, Z. Wang, N. Zhao, L. Zhang, D. Liu, J. Yang, and Q. Chen, “Re- thinking layered graphic design generation with a top-down approach,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 16 861–16 870

  9. [9]

    Layerd: Decompos- ing raster graphic designs into layers,

    T. Suzuki, K.-J. Liu, N. Inoue, and K. Yamaguchi, “Layerd: Decompos- ing raster graphic designs into layers,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 17 783–17 792

  10. [10]

    Resolution-robust large mask inpaint- ing with fourier convolutions,

    R. Suvorov, E. Logachevaet al., “Resolution-robust large mask inpaint- ing with fourier convolutions,” inWACV, 2022

  11. [11]

    Mutual dual-task generator with adaptive attention fusion for image inpainting,

    Y . Zhang, Y . Liu, R. Hu, Q. Wu, and J. Zhang, “Mutual dual-task generator with adaptive attention fusion for image inpainting,”IEEE Transactions on Multimedia, vol. 26, pp. 1539–1550, 2023

  12. [12]

    Dreamlayer: Simultaneous multi-layer generation via diffusion model,

    J. Huang, P. Yan, J. Cai, J. Liu, Z. Wang, Y . Wang, X. Wu, and G. Li, “Dreamlayer: Simultaneous multi-layer generation via diffusion model,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  13. [13]

    Mulan: A multi layer annotated dataset for controllable text-to-image generation,

    P.-D. Tudosiu, Y . Yang, S. Zhang, F. Chen, S. McDonagh, G. Lampouras, I. Iacobacci, and S. Parisot, “Mulan: A multi layer annotated dataset for controllable text-to-image generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 413–22 422

  14. [14]

    Generative image layer decomposition with visual effects,

    J. Yang, Q. Liu, Y . Li, S. Y . Kim, D. Pakhomov, M. Ren, J. Zhang, Z. Lin, C. Xie, and Y . Zhou, “Generative image layer decomposition with visual effects,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  15. [15]

    Decomposing images into layers via rgb-space geometry,

    J. Tan, J.-M. Lien, and Y . Gingold, “Decomposing images into layers via rgb-space geometry,”ACM Transactions on Graphics, vol. 36, no. 1, 2016

  16. [16]

    Efficient palette-based decom- position and recoloring of images via rgbxy-space geometry,

    J. Tan, J. Echevarria, and Y . Gingold, “Efficient palette-based decom- position and recoloring of images via rgbxy-space geometry,”ACM Transactions on Graphics, vol. 37, no. 6, 2018

  17. [17]

    Unmixing-based soft color segmentation for image manipulation,

    Y . Aksoy, T. O. Aydin, A. Smoli ´c, and M. Pollefeys, “Unmixing-based soft color segmentation for image manipulation,”ACM Transactions on Graphics, vol. 36, no. 2, 2017

  18. [18]

    Fast soft color segmentation,

    N. Akimoto, H. Zhu, Y . Jin, and Y . Aoki, “Fast soft color segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  19. [19]

    Decomposing images into layers with advanced color blending,

    Y . Koyama and M. Goto, “Decomposing images into layers with advanced color blending,”Computer Graphics F orum, vol. 37, no. 7, 2018

  20. [20]

    Fast nonlinear image unblending,

    D. Horita, K. Aizawa, R. Suzuki, T. Yonetsuji, and H. Zhu, “Fast nonlinear image unblending,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022

  21. [21]

    Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment,

    L. Yao, J. Han, X. Liang, D. Xu, W. Zhang, Z. Li, and H. Xu, “Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  22. [22]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

  23. [23]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,

    R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V . Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 3, 2022

  24. [24]

    Instance-wise occlusion and depth orders in natural scenes,

    H. Lee and J. Park, “Instance-wise occlusion and depth orders in natural scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  25. [25]

    Learning trimaps via clicks for image matting,

    C. Zhang, Y . Hu, H. Ding, H. Shi, Y . Zhao, and Y . Wei, “Learning trimaps via clicks for image matting,”IEEE Transactions on Multimedia, 2025

  26. [26]

    Neural discrete representation learning,

    A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” inAdvances in Neural Information Processing Systems, vol. 30, 2017

  27. [27]

    Taming transformers for high- resolution image synthesis,

    P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high- resolution image synthesis,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2021, pp. 12 873– 12 883

  28. [28]

    Lora: Low-rank adaptation of large language models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022

  29. [29]

    Parameter-efficient fine-tuning of large-scale pre-trained language models,

    N. Ding, Y . Qin, G. Yang, F. Wei, Z. Yang, Y . Su, S. Hu, Y . Chen, C.- M. Chan, W. Chenet al., “Parameter-efficient fine-tuning of large-scale pre-trained language models,”Nature machine intelligence, vol. 5, no. 3, pp. 220–235, 2023

  30. [30]

    Aligning large multimodal models with factually augmented RLHF,

    Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y . Shen, C. Gan, L.-Y . Gui, Y .-X. Wang, Y . Yang, K. Keutzer, and T. Darrell, “Aligning large multimodal models with factually augmented RLHF,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 13 088–13 110

  31. [31]

    Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback,

    T. Yu, Y . Yao, H. Zhang, T. He, Y . Han, G. Cui, J. Hu, Z. Liu, H.- T. Zheng, M. Sun, and T.-S. Chua, “Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 807–13 816

  32. [32]

    Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness,

    T. Yu, H. Zhang, Q. Li, Q. Xu, Y . Yao, D. Chen, X. Lu, G. Cui, Y . Dang, T. He, X. Feng, J. Song, B. Zheng, Z. Liu, T.-S. Chua, and M. Sun, “Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 19 985–19 995

  33. [33]

    Mm-rlhf: The next step forward in multimodal LLM alignment,

    Y .-F. Zhang, T. Yu, H. Tian, C. Fu, P. Li, J. Zeng, W. Xie, Y . Shi, H. Zhang, J. Wu, X. Wang, Y . Hu, B. Wen, F. Yang, Z. Zhang, T. Gao, D. Zhang, L. Wang, R. Jin, and T. Tan, “Mm-rlhf: The next step forward in multimodal LLM alignment,” inProceedings of the 42nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research...

  34. [34]

    Perception-r1: Pioneer- ing perception policy with reinforcement learning,

    E. Yu, K. Lin, L. Zhao, J. Yin, Y . Wei, Y . Peng, H. Wei, J. Sun, C. Han, Z. Ge, X. Zhang, D. Jiang, J. Wang, and W. Tao, “Perception-r1: Pioneer- ing perception policy with reinforcement learning,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  35. [35]

    Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272, 2025

    M. Cao, H. Zhao, C. Zhang, X. Chang, I. Reid, and X. Liang, “Ground- r1: Incentivizing grounded visual reasoning via reinforcement learning,” arXiv preprint arXiv:2505.20272, 2026

  36. [36]

    Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning,

    X. Li, Z. Yan, D. Meng, L. Dong, X. Zeng, Y . He, Y . Wang, Y . Qiao, Y . Wang, and L. Wang, “Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  37. [37]

    Video-r1: Reinforcing video reasoning in mllms,

    K. Feng, K. Gong, B. Li, Z. Guo, Y . Wang, T. Peng, J. Wu, X. Zhang, B. Wang, and X. Yue, “Video-r1: Reinforcing video reasoning in mllms,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id=a2JTVVvcEl

  38. [38]

    Openthinkimg: Learning to think with images via visual tool reinforcement learning

    Z. Su, L. Li, M. Song, Y . Hao, Z. Yang, J. Zhang, G. Chen, J. Gu, J. Li, X. Qu, and Y . Cheng, “Openthinkimg: Learning to think with images via visual tool reinforcement learning,”arXiv preprint arXiv:2505.08617, 2025

  39. [39]

    Animediff: Customized image generation of anime characters using diffusion model,

    Y . Jiang, Q. Liu, D. Chen, L. Yuan, and Y . Fu, “Animediff: Customized image generation of anime characters using diffusion model,”IEEE Transactions on Multimedia, vol. 26, pp. 10 559–10 572, 2024

  40. [40]

    Sgdm: An adaptive style- guided diffusion model for personalized text to image generation,

    Y . Xu, X. Xu, H. Gao, and F. Xiao, “Sgdm: An adaptive style- guided diffusion model for personalized text to image generation,”IEEE Transactions on Multimedia, vol. 26, pp. 9804–9813, 2024

  41. [41]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis,

    D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” inInternational Conference on Learning Representations, 2024

  42. [42]

    Transparent image layer diffusion using latent transparency,

    L. Zhang and M. Agrawala, “Transparent image layer diffusion using latent transparency,”ACM Transactions on Graphics, vol. 43, no. 4, pp. 1–15, 2024

  43. [43]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  44. [44]

    Canvasvae: Learning to generate vector graphic docu- ments,

    K. Yamaguchi, “Canvasvae: Learning to generate vector graphic docu- ments,”ICCV, 2021

  45. [45]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y . Chen, F. Yanet al., “Grounded sam: Assembling open-world models for diverse visual tasks,”arXiv preprint arXiv:2401.14159, 2024

  46. [46]

    GPT-4 Technical Report

    OpenAI, “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023