Recognition: unknown
CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers
Pith reviewed 2026-05-10 03:16 UTC · model grok-4.3
The pith
A hybrid generative model parses flat graphic designs into separately editable text, background, and sticker layers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a hybrid generative framework that decomposes a raster graphic design image into an editable text layer via a vision-language model outputting a text rendering protocol, plus background and sticker layers via a multi-branch diffusion architecture with RGBA channels. ParserReward is introduced and combined with Group Relative Policy Optimization to align the generated layers with human design preferences, yielding better results than prior methods on the Parser-40K and Crello datasets.
What carries the argument
The hybrid generative parsing architecture that routes text regions through a vision-language model to extract a re-usable rendering protocol while generating background and sticker layers in a multi-branch diffusion model with RGBA support, all refined by ParserReward scoring under Group Relative Policy Optimization.
If this is right
- Text elements can be altered by editing the rendering protocol while leaving background and sticker layers untouched.
- Background and sticker elements are produced with explicit transparency so they can be moved or removed independently.
- The single generative pass avoids the error buildup that occurs when layout prediction, matting, and inpainting run in sequence.
- Preference alignment through the reward model reduces artifacts that commonly appear in layer decompositions.
Where Pith is reading between the lines
- The same layer-separation logic could be applied to user-interface screenshots or illustration files to support automated editing workflows.
- Retraining the reward model on designer feedback from a specific software tool might improve results for that tool's typical output styles.
- Once layers are extracted, the parsed text protocol could feed directly into vector graphics editors for further refinement.
Load-bearing premise
The ParserReward model together with Group Relative Policy Optimization reliably encodes human design preferences and the chosen datasets represent the full variety of graphic designs without major unseen styles.
What would settle it
A new test collection of graphic designs drawn from sources outside the Parser-40K and Crello datasets where the hybrid method produces layers that are harder to edit or visually worse than outputs from existing layout-plus-matting pipelines.
Figures
read the original abstract
Graphic design images consist of multiple editable layers, such as text, background, and decorative elements, while most generative models produce rasterized outputs without explicit layer structures, limiting downstream editing. Existing graphic design parsing methods typically rely on multi-stage pipelines combining layout prediction, matting, and inpainting, which suffer from error accumulation and limited controllability. We propose a hybrid generative framework for raster-to-layer graphic design parsing that decomposes a design image into editable text, background, and sticker layers. Text regions are parsed using a vision-language model into a text rendering protocol, enabling faithful reconstruction and flexible re-editing, while background and sticker layers are generated using a multi-branch diffusion architecture with RGBA support. We further introduce ParserReward and integrate it with Group Relative Policy Optimization to align generation quality with human design preferences. Extensive experiments on two challenging datasets, \emph{i.e.,} the Parser-40K and Crello datasets, demonstrate superior performance over existing methods, \emph{eg.,} achieving an overall average improvement of 23.7\% across all metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents CreatiParser, a hybrid generative framework for raster-to-layer parsing of graphic designs. It uses a vision-language model to parse text regions into a rendering protocol for editable text, a multi-branch diffusion model with RGBA support for background and sticker layers, and introduces ParserReward combined with Group Relative Policy Optimization (GRPO) to align the generation with human design preferences. The approach is evaluated on the Parser-40K and Crello datasets, claiming an overall average improvement of 23.7% across all metrics compared to existing methods.
Significance. If the empirical results hold under proper controls, this work could meaningfully advance controllable generative parsing for graphic design by producing explicitly editable layer decompositions instead of flat raster outputs. The hybrid VLM-plus-diffusion pipeline augmented by a custom reward model for preference alignment is a timely direction that addresses error accumulation in multi-stage baselines and limited editability in standard generative models.
major comments (1)
- Abstract: The headline claim of an 'overall average improvement of 23.7% across all metrics' on Parser-40K and Crello is load-bearing for the paper's contribution, yet the abstract supplies no concrete metric definitions, baseline names, per-metric scores, or ablation results isolating ParserReward + GRPO. Without these details it is impossible to determine whether the reported gains arise from the new alignment step or from the base VLM/diffusion architecture, directly undermining the central empirical assertion.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract's presentation of results. We address the major comment below and will revise the manuscript accordingly to improve clarity.
read point-by-point responses
-
Referee: [—] Abstract: The headline claim of an 'overall average improvement of 23.7% across all metrics' on Parser-40K and Crello is load-bearing for the paper's contribution, yet the abstract supplies no concrete metric definitions, baseline names, per-metric scores, or ablation results isolating ParserReward + GRPO. Without these details it is impossible to determine whether the reported gains arise from the new alignment step or from the base VLM/diffusion architecture, directly undermining the central empirical assertion.
Authors: We agree that the abstract presents the 23.7% average improvement at a high level without sufficient context. The full experimental section provides the requested details: Table 2 reports per-metric scores (PSNR, SSIM, LPIPS, FID, and editability metrics) for each layer type on both Parser-40K and Crello; the baselines are the prior state-of-the-art graphic design parsing methods enumerated in Section 4.1; and Table 5 contains the ablation study isolating ParserReward + GRPO, showing that the alignment components contribute an additional 8-12% relative gain beyond the base VLM-diffusion pipeline. The 23.7% figure is the mean relative improvement across all metrics and datasets. To address the concern directly, we will revise the abstract to (1) name the primary baselines, (2) briefly categorize the metrics, and (3) note that ablations confirm the alignment step's contribution. This keeps the abstract concise while allowing readers to evaluate the source of the gains without immediately consulting the full results. revision: yes
Circularity Check
No circularity in the empirical pipeline
full rationale
The paper presents an empirical hybrid generative framework that decomposes raster designs into editable layers via a VLM for text rendering and multi-branch diffusion for RGBA layers, with ParserReward integrated via Group Relative Policy Optimization for preference alignment. No equations, derivations, or first-principles results are described that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Performance claims rest on experiments against external benchmarks (Parser-40K and Crello), making the approach self-contained without any load-bearing circular reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Graphic designs can be usefully decomposed into independent text, background, and sticker layers
invented entities (1)
-
ParserReward
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Seedream 4.0: Toward Next-generation Multimodal Image Generation
T. Seedream, Y . Chen, Y . Gao, L. Gong, M. Guo, Q. Guo, Z. Guo, X. Hou, W. Huang, Y . Huanget al., “Seedream 4.0: To- ward next-generation multimodal image generation,”arXiv preprint arXiv:2509.20427, 2025
work page internal anchor Pith review arXiv 2025
-
[2]
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
H. Cai, S. Cao, R. Du, P. Gao, S. Hoi, S. Huang, Z. Hou, D. Jiang, X. Jin, L. Liet al., “Z-image: An efficient image generation foun- dation model with single-stream diffusion transformer,”arXiv preprint arXiv:2511.22699, 2025
work page internal anchor Pith review arXiv 2025
-
[3]
Cookgalip: Recipe con- trollable generative adversarial clips with sequential ingredient prompts for food image generation,
M. Xu, J. Wang, M. Tao, B.-K. Bao, and C. Xu, “Cookgalip: Recipe con- trollable generative adversarial clips with sequential ingredient prompts for food image generation,”IEEE Transactions on Multimedia, 2024. 10
2024
-
[4]
Mgdefect: A mask-guided high-quality defect image generation method for improving defect inspection,
X. Jiang, Y . Li, F. Yan, Y . Lu, C. Xu, and M. Xu, “Mgdefect: A mask-guided high-quality defect image generation method for improving defect inspection,”IEEE Transactions on Multimedia, 2025
2025
-
[5]
Semantic distance adversarial learning for text-to-image synthesis,
B. Yuan, Y . Sheng, B.-K. Bao, Y .-P. P. Chen, and C. Xu, “Semantic distance adversarial learning for text-to-image synthesis,”IEEE Trans- actions on Multimedia, vol. 26, pp. 1255–1266, 2023
2023
-
[6]
Few-shot genera- tive model adaptation via style-guided prompt,
S. Pan, Z. Zhang, K. Wei, X. Yang, and C. Deng, “Few-shot genera- tive model adaptation via style-guided prompt,”IEEE Transactions on Multimedia, vol. 26, pp. 7661–7672, 2024
2024
-
[7]
Decomposition of graphic design with unified multimodal model,
H. Nie, Z. Zhang, Y . Cheng, M. Yang, G. Shi, Q. Xie, J. Shao, and X. Wu, “Decomposition of graphic design with unified multimodal model,” inF orty-second International Conference on Machine Learning, 2025
2025
-
[8]
Re- thinking layered graphic design generation with a top-down approach,
J. Chen, Z. Wang, N. Zhao, L. Zhang, D. Liu, J. Yang, and Q. Chen, “Re- thinking layered graphic design generation with a top-down approach,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 16 861–16 870
2025
-
[9]
Layerd: Decompos- ing raster graphic designs into layers,
T. Suzuki, K.-J. Liu, N. Inoue, and K. Yamaguchi, “Layerd: Decompos- ing raster graphic designs into layers,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 17 783–17 792
2025
-
[10]
Resolution-robust large mask inpaint- ing with fourier convolutions,
R. Suvorov, E. Logachevaet al., “Resolution-robust large mask inpaint- ing with fourier convolutions,” inWACV, 2022
2022
-
[11]
Mutual dual-task generator with adaptive attention fusion for image inpainting,
Y . Zhang, Y . Liu, R. Hu, Q. Wu, and J. Zhang, “Mutual dual-task generator with adaptive attention fusion for image inpainting,”IEEE Transactions on Multimedia, vol. 26, pp. 1539–1550, 2023
2023
-
[12]
Dreamlayer: Simultaneous multi-layer generation via diffusion model,
J. Huang, P. Yan, J. Cai, J. Liu, Z. Wang, Y . Wang, X. Wu, and G. Li, “Dreamlayer: Simultaneous multi-layer generation via diffusion model,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025
2025
-
[13]
Mulan: A multi layer annotated dataset for controllable text-to-image generation,
P.-D. Tudosiu, Y . Yang, S. Zhang, F. Chen, S. McDonagh, G. Lampouras, I. Iacobacci, and S. Parisot, “Mulan: A multi layer annotated dataset for controllable text-to-image generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 413–22 422
2024
-
[14]
Generative image layer decomposition with visual effects,
J. Yang, Q. Liu, Y . Li, S. Y . Kim, D. Pakhomov, M. Ren, J. Zhang, Z. Lin, C. Xie, and Y . Zhou, “Generative image layer decomposition with visual effects,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
2025
-
[15]
Decomposing images into layers via rgb-space geometry,
J. Tan, J.-M. Lien, and Y . Gingold, “Decomposing images into layers via rgb-space geometry,”ACM Transactions on Graphics, vol. 36, no. 1, 2016
2016
-
[16]
Efficient palette-based decom- position and recoloring of images via rgbxy-space geometry,
J. Tan, J. Echevarria, and Y . Gingold, “Efficient palette-based decom- position and recoloring of images via rgbxy-space geometry,”ACM Transactions on Graphics, vol. 37, no. 6, 2018
2018
-
[17]
Unmixing-based soft color segmentation for image manipulation,
Y . Aksoy, T. O. Aydin, A. Smoli ´c, and M. Pollefeys, “Unmixing-based soft color segmentation for image manipulation,”ACM Transactions on Graphics, vol. 36, no. 2, 2017
2017
-
[18]
Fast soft color segmentation,
N. Akimoto, H. Zhu, Y . Jin, and Y . Aoki, “Fast soft color segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
2020
-
[19]
Decomposing images into layers with advanced color blending,
Y . Koyama and M. Goto, “Decomposing images into layers with advanced color blending,”Computer Graphics F orum, vol. 37, no. 7, 2018
2018
-
[20]
Fast nonlinear image unblending,
D. Horita, K. Aizawa, R. Suzuki, T. Yonetsuji, and H. Zhu, “Fast nonlinear image unblending,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022
2022
-
[21]
Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment,
L. Yao, J. Han, X. Liang, D. Xu, W. Zhang, Z. Li, and H. Xu, “Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
2023
-
[22]
Segment anything,
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023
2023
-
[23]
Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,
R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V . Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 3, 2022
2022
-
[24]
Instance-wise occlusion and depth orders in natural scenes,
H. Lee and J. Park, “Instance-wise occlusion and depth orders in natural scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
2022
-
[25]
Learning trimaps via clicks for image matting,
C. Zhang, Y . Hu, H. Ding, H. Shi, Y . Zhao, and Y . Wei, “Learning trimaps via clicks for image matting,”IEEE Transactions on Multimedia, 2025
2025
-
[26]
Neural discrete representation learning,
A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” inAdvances in Neural Information Processing Systems, vol. 30, 2017
2017
-
[27]
Taming transformers for high- resolution image synthesis,
P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high- resolution image synthesis,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2021, pp. 12 873– 12 883
2021
-
[28]
Lora: Low-rank adaptation of large language models
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022
2022
-
[29]
Parameter-efficient fine-tuning of large-scale pre-trained language models,
N. Ding, Y . Qin, G. Yang, F. Wei, Z. Yang, Y . Su, S. Hu, Y . Chen, C.- M. Chan, W. Chenet al., “Parameter-efficient fine-tuning of large-scale pre-trained language models,”Nature machine intelligence, vol. 5, no. 3, pp. 220–235, 2023
2023
-
[30]
Aligning large multimodal models with factually augmented RLHF,
Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y . Shen, C. Gan, L.-Y . Gui, Y .-X. Wang, Y . Yang, K. Keutzer, and T. Darrell, “Aligning large multimodal models with factually augmented RLHF,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 13 088–13 110
2024
-
[31]
Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback,
T. Yu, Y . Yao, H. Zhang, T. He, Y . Han, G. Cui, J. Hu, Z. Liu, H.- T. Zheng, M. Sun, and T.-S. Chua, “Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 807–13 816
2024
-
[32]
Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness,
T. Yu, H. Zhang, Q. Li, Q. Xu, Y . Yao, D. Chen, X. Lu, G. Cui, Y . Dang, T. He, X. Feng, J. Song, B. Zheng, Z. Liu, T.-S. Chua, and M. Sun, “Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 19 985–19 995
2025
-
[33]
Mm-rlhf: The next step forward in multimodal LLM alignment,
Y .-F. Zhang, T. Yu, H. Tian, C. Fu, P. Li, J. Zeng, W. Xie, Y . Shi, H. Zhang, J. Wu, X. Wang, Y . Hu, B. Wen, F. Yang, Z. Zhang, T. Gao, D. Zhang, L. Wang, R. Jin, and T. Tan, “Mm-rlhf: The next step forward in multimodal LLM alignment,” inProceedings of the 42nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research...
2025
-
[34]
Perception-r1: Pioneer- ing perception policy with reinforcement learning,
E. Yu, K. Lin, L. Zhao, J. Yin, Y . Wei, Y . Peng, H. Wei, J. Sun, C. Han, Z. Ge, X. Zhang, D. Jiang, J. Wang, and W. Tao, “Perception-r1: Pioneer- ing perception policy with reinforcement learning,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[35]
M. Cao, H. Zhao, C. Zhang, X. Chang, I. Reid, and X. Liang, “Ground- r1: Incentivizing grounded visual reasoning via reinforcement learning,” arXiv preprint arXiv:2505.20272, 2026
-
[36]
Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning,
X. Li, Z. Yan, D. Meng, L. Dong, X. Zeng, Y . He, Y . Wang, Y . Qiao, Y . Wang, and L. Wang, “Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[37]
Video-r1: Reinforcing video reasoning in mllms,
K. Feng, K. Gong, B. Li, Z. Guo, Y . Wang, T. Peng, J. Wu, X. Zhang, B. Wang, and X. Yue, “Video-r1: Reinforcing video reasoning in mllms,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id=a2JTVVvcEl
2025
-
[38]
Openthinkimg: Learning to think with images via visual tool reinforcement learning
Z. Su, L. Li, M. Song, Y . Hao, Z. Yang, J. Zhang, G. Chen, J. Gu, J. Li, X. Qu, and Y . Cheng, “Openthinkimg: Learning to think with images via visual tool reinforcement learning,”arXiv preprint arXiv:2505.08617, 2025
-
[39]
Animediff: Customized image generation of anime characters using diffusion model,
Y . Jiang, Q. Liu, D. Chen, L. Yuan, and Y . Fu, “Animediff: Customized image generation of anime characters using diffusion model,”IEEE Transactions on Multimedia, vol. 26, pp. 10 559–10 572, 2024
2024
-
[40]
Sgdm: An adaptive style- guided diffusion model for personalized text to image generation,
Y . Xu, X. Xu, H. Gao, and F. Xiao, “Sgdm: An adaptive style- guided diffusion model for personalized text to image generation,”IEEE Transactions on Multimedia, vol. 26, pp. 9804–9813, 2024
2024
-
[41]
Sdxl: Improving latent diffusion models for high-resolution image synthesis,
D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” inInternational Conference on Learning Representations, 2024
2024
-
[42]
Transparent image layer diffusion using latent transparency,
L. Zhang and M. Agrawala, “Transparent image layer diffusion using latent transparency,”ACM Transactions on Graphics, vol. 43, no. 4, pp. 1–15, 2024
2024
-
[43]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Canvasvae: Learning to generate vector graphic docu- ments,
K. Yamaguchi, “Canvasvae: Learning to generate vector graphic docu- ments,”ICCV, 2021
2021
-
[45]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y . Chen, F. Yanet al., “Grounded sam: Assembling open-world models for diverse visual tasks,”arXiv preprint arXiv:2401.14159, 2024
work page internal anchor Pith review arXiv 2024
-
[46]
OpenAI, “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.