pith. sign in

arxiv: 2312.16476 · v7 · submitted 2023-12-27 · 💻 cs.CV · cs.AI

SVGDreamer: Text Guided SVG Generation with Diffusion Model

Pith reviewed 2026-05-24 05:26 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords text-guided SVGdiffusion modelsvector graphicsscore distillationsemantic vectorizationparticle-based optimizationeditability
0
0 comments X

The pith

SVGDreamer uses semantic decomposition and particle distillation to generate editable and diverse text-guided SVGs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to improve text-to-SVG synthesis by addressing poor editability, visual quality, and sample diversity in prior work. It establishes a framework that first vectorizes images semantically into foreground and background components using attention controls, then optimizes the vectors as particle distributions with score distillation and aesthetic rewards. If this holds, it would make text-prompted vector graphics practical for design applications where easy modification is essential.

Core claim

SVGDreamer shows that its SIVE process with attention-based primitive control and attention-mask loss, together with VPSD that models SVGs over control points and colors with reward reweighting, leads to vector outputs that outperform baselines in editability, quality, and diversity.

What carries the argument

Semantic-driven image vectorization (SIVE) that separates foreground objects and background with attention mechanisms, combined with Vectorized Particle-based Score Distillation (VPSD) for distributional optimization of vector parameters.

If this is right

  • Vector elements can be edited independently due to the attention-mask loss and primitive control.
  • Shapes avoid over-smoothing and colors avoid over-saturation through particle-based modeling.
  • Generation converges more quickly when particles are reweighted by a reward model.
  • Diversity of outputs increases because SVGs are treated as distributions rather than single optimizations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the semantic split works well, the method could extend to generating complex multi-object scenes with consistent style.
  • Design tools might incorporate this to allow prompt-based starting points for vector editing sessions.
  • Testing on prompts involving fine details like text in icons could reveal limits of the current attention control.

Load-bearing premise

Attention-based primitive control combined with an attention-mask loss enables fine-grained independent manipulation of individual vector elements without artifacts or loss of global coherence.

What would settle it

A direct comparison experiment where SVGDreamer SVGs do not score higher on editability measures or diversity metrics than the baselines would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2312.16476 by Chuang Wang, Dong Xu, Haitao Zhou, Jing Zhang, Qian Yu, Ximing Xing.

Figure 1
Figure 1. Figure 1: Given a text prompt, SVGDreamer can generate a variety of vector graphics. SVGDreamer is a versatile tool that can work with [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SVGDreamer. The method consists of two parts: semantic-driven image vectorization (SIVE, Sec. 3.1) and SVG synthesis through VPSD optimization (Sec. 3.2). The result obtained from SIVE can be used as input of VPSD for further refinement. 3.1.2 Semantic-aware Optimization In this stage, we utilize an attention-based mask loss to sep￾arately optimize the objects in the foreground and back￾ground.… view at source ↗
Figure 3
Figure 3. Figure 3: The process of Vectorized Particle-based Score Dis￾tillation. VPSD allows k SVGs as input and simultaneously opti￾mizes k sets of SVG parameters. estimated by, ∇θLSDS(ϕ, x = R(θ)) ≜ Et,ϵ,a  w(t)(ϵϕ(zt; y, t) − ϵ) ∂z ∂xa ∂xa ∂θ  (3) where w(t) is the weighting function. And noised to form zt = αtxa + σtϵ. Unfortunately, SDS-based methods often suffer from is￾sues such as shape over-smoothing, color over-s… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of different methods. Note that DiffSketcher was originally designed for vector sketch generation; therefore, we re-implemented it to generate RGB vector graphics. This style allows for a wide range of compositions while maintaining a minimalistic expression. We utilize closed form Bezier curves with trainable control points and fill col- ´ ors. 2) Sketch is a way to convey informati… view at source ↗
Figure 5
Figure 5. Figure 5: Examples of vector assets created by SVGDreamer. We specify foreground content as an SVG asset through a text prompt. To create assets that fit the SVG style, such as flat polygon vector, we constrain the vector representation via using a different prompt modifier to encourage the appropriate style: * ... on a white background, full body action pose, complete body, concept art, flat 2d vector icon. LIVE (G… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of LIVE vectorization with SIVE. In the first row, “Foreground 1” and “Foreground 2” refer to Astronaut and Plants, respectively. Glyphs have been added manually and were not produced by our method. In the LIVE setup, we fol￾low the protocol outlined in VectorFusion [12], which represents a vector image with 128 paths distributed across four layers, with 32 paths in each layer. hierarchies acros… view at source ↗
Figure 7
Figure 7. Figure 7: Examples showcasing the editability of the results generated by our SVGDreamer. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: More results generated by our SVGDreamer. The style is governed by vector primitives. 3 [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of synthetic posters generated by different methods. The input text prompts and glyphs to be added to the posters are displayed on the left side. ”Bold logo icon in blue, black, white colors for a “An astronaut, the logo, vector art.” simplified version of Great Wave of Kanagawa” Temple Temple “The logo of the Japanese mystery temple,, game art, cartoon, 3d animation style” “A Starbucks coffee c… view at source ↗
Figure 10
Figure 10. Figure 10: Examples of synthetic icons. Note that the glyphs are manually added. A man in an astronaut suit walking …… A beautiful photo of the Eiffel Tower [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualizations of the LDM cross-attention maps. timization stability. Note that SDS-based methods [12, 48] do not work well in such small CFG weights. Instead, our VPSD provides a trade-off option between CFG weight and diversity, and it can generate more diverse results by simply setting a smaller CFG. E.2. Ablation on ReFL In [45], only selected particles update the LoRA network in each iteration. Howev… view at source ↗
Figure 12
Figure 12. Figure 12: Ablation on how Classifier-free Guidances (CFG) [7] weight affects the randomness. Smaller CFG provides more diversity. But too small CFG provides less optimization stability. The prompt is “A photograph of an astronaut riding a horse”. 16 and analyze how this variation affects the outcomes. The CFG of VPSD is set as 7.5. As shown in [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Effect of the Reward Feedback Learning (ReFL). When employing ReFL, the visual quality of the generated results is significantly enhanced. 16 particles 8 particles 4 particles 1 particles Seed 1 Seed 2 Seed 2 Seed 1 Seed 1 Seed 2 Seed 3 Seed 4 Seed 5 Seed 6 Seed 7 Seed 8 [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Ablation on the number of particles. The diversity of the generated results is slightly larger as the number of particles increases. The quality of generated results is not significantly affected by the number of particles. The prompt is “A photograph of an astronaut riding a horse”. erated results is not significantly affected by the number of particles. Considering the high computation overhead asso￾cia… view at source ↗
Figure 15
Figure 15. Figure 15: Effect of the number of paths. Adding vector paths can be synthesized to enhance SVG detail. VSD VSPD [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: 2D image synthesis. Comparison of the results from using VPSD and VSD for 2D image synthesis. like water reflections. Additionally, VPSD better aligns with text prompts. F. VPSD for 2D Image Synthesis In this work, VPSD is specifically designed for text-to-SVG generation; however, it can also be adapted for 2D image synthesis. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
read the original abstract

Text-guided scalable vector graphics (SVG) synthesis has broad applications in icon and sketch generation. However, existing text-to-SVG methods often suffer from limited editability, suboptimal visual quality, and low sample diversity. To address these challenges, we propose \textbf{SVGDreamer}, a novel framework for text-guided vector graphics synthesis. Our method introduces a \textbf{semantic-driven image vectorization (SIVE)} process, which decomposes the generation procedure into foreground objects and background elements, thereby improving structural controllability and editability. In particular, SIVE incorporates attention-based primitive control and an attention-mask loss to facilitate fine-grained manipulation of individual vector elements. To further improve generation quality and diversity, we propose \textbf{Vectorized Particle-based Score Distillation (VPSD)}, which models SVGs as distributions over control points and colors. Compared with existing text-to-SVG optimization methods, VPSD alleviates over-smoothed shapes, over-saturated colors, limited diversity, and slow convergence. Moreover, VPSD leverages a reward model to reweight vector particles, leading to better visual aesthetics and faster convergence. Extensive experiments demonstrate that SVGDreamer consistently outperforms existing baselines in editability, visual quality, and diversity. Project page: https://ximinng.github.io/SVGDreamer-project/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents SVGDreamer, a framework for text-guided SVG generation. It introduces a semantic-driven image vectorization (SIVE) process that decomposes generation into foreground objects and background elements, incorporating attention-based primitive control and an attention-mask loss to improve structural controllability and editability. It further proposes Vectorized Particle-based Score Distillation (VPSD), which models SVGs as distributions over control points and colors and uses a reward model to reweight particles for improved quality, diversity, and convergence. The central claim is that extensive experiments demonstrate consistent outperformance over existing baselines in editability, visual quality, and diversity.

Significance. If the results hold, the work would advance text-to-SVG synthesis by improving fine-grained editability and sample diversity, with applications in icon and sketch generation. The combination of semantic decomposition via SIVE and particle-based optimization in VPSD is a novel direction that builds directly on external diffusion models without self-referential parameter fitting.

major comments (2)
  1. [SIVE process description] The headline claim of superior editability rests on the SIVE process's attention-based primitive control and attention-mask loss enabling independent manipulation of individual vector elements. The manuscript provides no analysis or evidence that this loss is strong enough to overcome the typically soft, spatially extended nature of diffusion attention maps and prevent cross-talk between primitives while preserving global coherence.
  2. [Experimental results] The abstract asserts outperformance on editability, quality, and diversity, yet the provided text contains no quantitative metrics, baseline comparisons, or ablation results to support these claims; without such data the central experimental superiority cannot be verified.
minor comments (1)
  1. [VPSD optimization] The description of VPSD as modeling SVGs as distributions over control points and colors would benefit from an explicit equation or pseudocode definition to clarify the particle reweighting step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on SVGDreamer. We address each major comment below with targeted responses and planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [SIVE process description] The headline claim of superior editability rests on the SIVE process's attention-based primitive control and attention-mask loss enabling independent manipulation of individual vector elements. The manuscript provides no analysis or evidence that this loss is strong enough to overcome the typically soft, spatially extended nature of diffusion attention maps and prevent cross-talk between primitives while preserving global coherence.

    Authors: We acknowledge the absence of a dedicated quantitative analysis of cross-talk in the current manuscript. The attention-mask loss is explicitly designed to align rendered primitive masks with diffusion attention maps, and the semantic decomposition in SIVE further localizes control. Qualitative editing results demonstrate independent manipulation with minimal visible interference. In revision we will add a new analysis subsection with metrics (e.g., mask overlap ratios before/after editing) and discussion of how the loss interacts with soft attention maps while maintaining coherence. revision: yes

  2. Referee: [Experimental results] The abstract asserts outperformance on editability, quality, and diversity, yet the provided text contains no quantitative metrics, baseline comparisons, or ablation results to support these claims; without such data the central experimental superiority cannot be verified.

    Authors: The experiments section (Section 4) of the full manuscript contains quantitative evaluations, including user-study scores for editability, diversity measured via feature variance, and visual-quality comparisons against baselines such as VectorFusion and DiffSketch, plus ablations on SIVE and VPSD. If these elements were not apparent in the reviewed copy, we will expand the section with additional tables, statistical significance tests, and clearer baseline descriptions to make the supporting data unambiguous. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation or claims

full rationale

The paper introduces SVGDreamer via two new components (SIVE with attention-based primitive control and attention-mask loss; VPSD with particle-based score distillation and reward reweighting) that are described as novel constructions building on external diffusion models. No equations, fitted parameters, or self-citations are presented that reduce the claimed editability/quality/diversity gains to quantities defined by the authors' own inputs or prior work. The experimental comparisons to baselines are external and falsifiable, leaving the central claims self-contained rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claims rest on the untested effectiveness of the newly introduced attention-mask loss and particle reweighting; no external benchmarks or prior proofs are cited for these mechanisms.

invented entities (2)
  • SIVE process no independent evidence
    purpose: Decompose SVG generation into foreground objects and background elements for structural controllability
    Newly proposed decomposition step with no independent prior evidence supplied in the abstract.
  • VPSD optimization no independent evidence
    purpose: Model SVGs as distributions over control points and colors to reduce over-smoothing and improve diversity
    New particle-based score distillation variant introduced without external validation in the abstract.

pith-pipeline@v0.9.0 · 5773 in / 1185 out tokens · 21597 ms · 2026-05-24T05:26:53.342007+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Voxify3D: Pixel Art Meets Volumetric Rendering

    cs.CV 2025-12 unverdicted novelty 7.0

    Voxify3D generates voxel art from 3D meshes via orthographic pixel supervision, patch-based CLIP alignment, and palette-constrained Gumbel-Softmax quantization, achieving 37.12 CLIP-IQA and 77.90% user preference.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Deepsvg: A hierarchical generative network for vector graphics animation

    Alexandre Carlier, Martin Danelljan, Alexandre Alahi, and Radu Timofte. Deepsvg: A hierarchical generative network for vector graphics animation. Advances in Neural Informa- tion Processing Systems (NIPS), 33:16351–16361, 2020. 2

  2. [2]

    Textdiffuser: Diffusion models as text painters

    Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters. arXiv preprint arXiv:2305.10855, 2023. 1

  3. [3]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition (NIPS) , pages 12873–12883, 2021. 5

  4. [4]

    CLIPDraw: Exploring text-to-drawing synthesis through language-image encoders

    Kevin Frans, Lisa Soros, and Olaf Witkowski. CLIPDraw: Exploring text-to-drawing synthesis through language-image encoders. In Advances in Neural Information Processing Systems (NIPS), 2022. 1, 2, 7, 8

  5. [5]

    A neural representation of sketch drawings

    David Ha and Douglas Eck. A neural representation of sketch drawings. In International Conference on Learning Representations (ICLR), 2018. 2

  6. [6]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems (NIPS), 30, 2017. 7, 8

  7. [7]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 2, 5

  8. [8]

    Denoising dif- fusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. In Advances in Neural Infor- mation Processing Systems (NIPS), pages 6840–6851, 2020. 2

  9. [9]

    Image quality metrics: Psnr vs

    Alain Hor ´e and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In 2010 20th International Conference on Pattern Recognition, pages 2366–2369, 2010. 7, 8

  10. [10]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations (ICLR),

  11. [11]

    Word-as-image for semantic typography

    Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir. Word-as-image for semantic typography. ACM Transactions on Graphics (TOG), 42(4),

  12. [12]

    Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models

    Ajay Jain, Amber Xie, and Pieter Abbeel. Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2023. 1, 2, 4, 5, 6, 7, 8

  13. [13]

    Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In In- ternational Conference on Machine Learning (ICML), pages 12888–12900. PMLR, 2022. 7, 8

  14. [14]

    Differentiable vector graphics rasterization for editing and learning

    Tzu-Mao Li, Michal Luk ´aˇc, Gharbi Micha ¨el, and Jonathan Ragan-Kelley. Differentiable vector graphics rasterization for editing and learning. ACM Transactions on Graphics (TOG), 39(6):193:1–193:15, 2020. 1, 2, 4

  15. [15]

    Magic3d: High-resolution text-to-3d content creation

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 300–309, 2023. 4

  16. [16]

    A learned representation for scalable vec- tor graphics

    Raphael Gontijo Lopes, David Ha, Douglas Eck, and Jonathon Shlens. A learned representation for scalable vec- tor graphics. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. 2

  17. [17]

    Towards layer- wise image vectorization

    Xu Ma, Yuqian Zhou, Xingqian Xu, Bin Sun, Valerii Filev, Nikita Orlov, Yun Fu, and Humphrey Shi. Towards layer- wise image vectorization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16314–16323, 2022. 2, 4, 7

  18. [18]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM , 65(1):99–106, 2021. 4

  19. [19]

    Clip-clop: Clip-guided collage and photomontage

    Piotr Mirowski, Dylan Banarse, Mateusz Malinowski, Si- mon Osindero, and Chrisantha Fernando. Clip-clop: Clip-guided collage and photomontage. arXiv preprint arXiv:2205.03146, 2022. 1, 2

  20. [20]

    GLIDE: Towards photorealis- tic image generation and editing with text-guided diffusion 9 models

    Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealis- tic image generation and editing with text-guided diffusion 9 models. In Proceedings of the 39th International Conference on Machine Learning (ICML), pages 16784–16804, 2022. 1, 2

  21. [21]

    Do 2d {gan}s know 3d shape? unsupervised 3d shape reconstruction from 2d image{gan}s

    Xingang Pan, Bo Dai, Ziwei Liu, Chen Change Loy, and Ping Luo. Do 2d {gan}s know 3d shape? unsupervised 3d shape reconstruction from 2d image{gan}s. In International Conference on Learning Representations (ICLR), 2021. 4

  22. [22]

    Barron, and Ben Milden- hall

    Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representa- tions (ICLR), 2023. 2, 4, 5, 6, 8

  23. [23]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In International Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021. 1, 2, 7, 8

  24. [24]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents. arXiv preprint arXiv:2204.06125,

  25. [25]

    Im2vec: Synthesizing vector graphics without vector supervision

    Pradyumna Reddy, Michael Gharbi, Michal Lukac, and Niloy J Mitra. Im2vec: Synthesizing vector graphics without vector supervision. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 7342–7351, 2021. 2

  26. [26]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 1, 2, 4, 6

  27. [27]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems (NIPS), pages 36479–36494, 2022. 1, 2, 4

  28. [28]

    Styleclipdraw: Coupling content and style in text-to-drawing synthesis

    Peter Schaldenbrand, Zhixuan Liu, and Jean Oh. Styleclip- draw: Coupling content and style in text-to-drawing synthe- sis. arXiv preprint arXiv:2111.03133, 2022. 1, 2

  29. [29]

    Improved aesthetic predictor

    Christoph Schuhmann. Improved aesthetic predictor. https : / / github . com / christophschuhmann / improved-aesthetic-predictor, 2022. 7, 8

  30. [30]

    Clipgen: A deep gener- ative model for clipart vectorization and synthesis

    I-Chao Shen and Bing-Yu Chen. Clipgen: A deep gener- ative model for clipart vectorization and synthesis. IEEE Transactions on Visualization and Computer Graphics , 28 (12):4211–4224, 2022. 2

  31. [31]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the In- ternational Conference on Machine Learning (ICML), pages 2256–2265, 2015. 2

  32. [32]

    Denois- ing diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. In International Conference on Learning Representations (ICLR), 2021. 6

  33. [33]

    Generative modeling by es- timating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by es- timating gradients of the data distribution. In Advances in Neural Information Processing Systems (NIPS), 2019. 2

  34. [34]

    Clipfont: Text guided vector wordart generation

    Yiren Song and Yuxuan Zhang. Clipfont: Text guided vector wordart generation. In 33rd British Machine Vision Con- ference 2022, BMVC 2022, London, UK, November 21-24, 2022, 2022. 1

  35. [35]

    Score-based generative modeling through stochastic differential equa- tions

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. In International Conference on Learning Represen- tations (ICLR), 2021. 2

  36. [36]

    Clipvg: Text-guided image manipulation using differentiable vector graphics

    Yiren Song, Xuning Shao, Kang Chen, Weidong Zhang, Zhongliang Jing, and Minzhe Li. Clipvg: Text-guided image manipulation using differentiable vector graphics. In Pro- ceedings of the Conference on Artificial Intelligence (AAAI),

  37. [37]

    If by deepfloyd lab at stabilityai

    StabilityAI. If by deepfloyd lab at stabilityai. https:// github.com/deep-floyd/IF, 2023. 1, 2

  38. [38]

    Marvel: Raster gray-level manga vectorization via primitive-wise deep reinforcement learn- ing

    Hao Su, Xuefeng Liu, Jianwei Niu, Jiahe Cui, Ji Wan, Xing- hao Wu, and Nana Wang. Marvel: Raster gray-level manga vectorization via primitive-wise deep reinforcement learn- ing. IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT), 2023. 2

  39. [39]

    Modern evolution strategies for creativity: Fitting concrete images and abstract concepts

    Yingtao Tian and David Ha. Modern evolution strategies for creativity: Fitting concrete images and abstract concepts. In Artificial Intelligence in Music, Sound, Art and Design , pages 275–291. Springer, 2022. 2

  40. [40]

    Clipasso: Semantically-aware object sketching

    Yael Vinker, Ehsan Pajouheshgar, Jessica Y Bo, Ro- man Christian Bachmann, Amit Haim Bermano, Daniel Cohen-Or, Amir Zamir, and Ariel Shamir. Clipasso: Semantically-aware object sketching. ACM Transactions on Graphics (TOG), 41(4):1–11, 2022. 1, 2

  41. [41]

    Clipascene: Scene sketching with different types and levels of abstraction

    Yael Vinker, Yuval Alaluf, Daniel Cohen-Or, and Ariel Shamir. Clipascene: Scene sketching with different types and levels of abstraction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4146–4156, 2023. 1

  42. [42]

    Yeh, and Greg Shakhnarovich

    Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12619–12629, 2023. 4

  43. [43]

    Deepvecfont: Synthesizing high-quality vector fonts via dual-modality learning

    Yizhi Wang and Zhouhui Lian. Deepvecfont: Synthesizing high-quality vector fonts via dual-modality learning. ACM Transactions on Graphics (TOG), 40(6), 2021. 2

  44. [44]

    Aesthetic text logo synthesis via content-aware layout inferring

    Yizhi Wang, Gu Pu, Wenhan Luo, Pengfei Wang, Yexin ans Xiong, Hongwen Kang, Zhonghao Wang, and Zhouhui Lian. Aesthetic text logo synthesis via content-aware layout inferring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

  45. [45]

    Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion. arXiv preprint arXiv:2305.16213, 2023. 4, 6 10

  46. [46]

    Icon- shop: Text-based vector icon synthesis with autoregressive transformers

    Ronghuan Wu, Wanchao Su, Kede Ma, and Jing Liao. Icon- shop: Text-based vector icon synthesis with autoregressive transformers. arXiv preprint arXiv:2304.14400, 2023. 2

  47. [47]

    Human preference score: Better aligning text-to- image models with human preference

    Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hong- sheng Li. Human preference score: Better aligning text-to- image models with human preference. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2096–2105, 2023. 7, 8

  48. [48]

    Diffsketcher: Text guided vector sketch synthesis through latent diffusion models

    Ximing Xing, Chuang Wang, Haitao Zhou, Jing Zhang, Qian Yu, and Dong Xu. Diffsketcher: Text guided vector sketch synthesis through latent diffusion models. In Advances in Neural Information Processing Systems (NIPS), 2023. 1, 2, 4, 5, 6, 7, 8

  49. [49]

    Imagere- ward: Learning and evaluating human preferences for text- to-image generation, 2023

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation, 2023. 2, 6, 4, 8

  50. [50]

    man” and “astronaut

    Yukang Yang, Dongnan Gui, Yuhui Yuan, Haisong Ding, Han Hu, and Kai Chen. Glyphcontrol: Glyph conditional control for visual text generation. 2023. 1 11 SVGDreamer: Text Guided SVG Generation with Diffusion Model Supplementary Material Overview This supplementary material is organized into several sec- tions that provide additional details and analysis re...