pith. machine review for the scientific record. sign in

arxiv: 2604.17054 · v1 · submitted 2026-04-18 · 💻 cs.CV · cs.AI

Recognition: unknown

mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multimodal embeddingsSVGvector graphicstraining-freeMLLMretrievalinstruction-guided
0
0 comments X

The pith

A training-free method aligns embeddings for text, images, and SVG code by instructing a multimodal LLM to summarize each input as a single token.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces mEOL, a framework that repurposes a multimodal large language model to create embeddings for different modalities without any training. It uses specific instructions to force the model to condense text, image, or SVG inputs into one token, taking the hidden state as the embedding. For SVGs, an additional step rewrites the code by reasoning about the rendered image to add meaningful names and simplify structure. This setup is tested on a new text-to-SVG retrieval benchmark built from VGBench, where it outperforms both encoder-based and training-based methods.

Core claim

The authors claim that modality-specific instructions combined with one-word limitation in an MLLM produce hidden states that serve as aligned embeddings for text, raster images, and SVG code, and that semantic rewriting of SVG using visual reasoning further improves structure awareness, leading to better retrieval performance than trained alternatives on the introduced benchmark.

What carries the argument

The Multimodal Explicit One-word Limitation (mEOL) mechanism, which directs the MLLM to produce a single token summary for any input modality, with its hidden state used as the embedding; supported by a semantic SVG rewriting module that uses visual reasoning to expose geometric cues.

If this is right

  • Embeddings for SVGs can be generated without rasterization or specialized training.
  • Prompt control replaces the need for learned projection layers in multimodal alignment.
  • The first dedicated benchmark for text-to-SVG retrieval is established and solved effectively by this method.
  • Structured code like SVG benefits from explicit rewriting based on visual interpretation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar single-token prompting might enable zero-shot alignment in other domains such as 3D models or molecular structures.
  • This could make multimodal retrieval more accessible by eliminating the need for large training datasets.
  • The method implies that MLLMs' pre-trained knowledge suffices for cross-modal semantic matching when guided properly.

Load-bearing premise

That the hidden state corresponding to the single summary token generated under modality-specific instructions carries semantic meaning that is consistent and aligned across text, images, and SVG representations without further training.

What would settle it

Demonstrating that the proposed embeddings do not achieve higher retrieval accuracy than the encoder-based and training-based baselines on the text-to-SVG benchmark would falsify the superiority claim.

Figures

Figures reproduced from arXiv: 2604.17054 by Baek Seong-Eun, Kyeong Seon Kim, Lee Jung-Mok, Tae-Hyun Oh.

Figure 1
Figure 1. Figure 1: Overview of Our Training-free Multimodal Embedding Method. (a) The semantic SVG module rewrites raw SVG code by assigning meaningful identifiers and simplifying structure through MLLM-based visual reasoning. (b) Multimodal Explicit One-word Limitation (mEOL) prompts instruct the MLLM to summarize text, images, and SVG code into a single-token embedding, placing all modalities in an aligned space. (c) The r… view at source ↗
Figure 2
Figure 2. Figure 2: Components of Our Training-free Multimodal Embedding pipeline. (a) Given text, image, or SVG inputs, each input is paired with a multimodal EOL (mEOL): Semantic EOL Prompt for text and Visual EOL Prompt for images. Then, our training-free MLLM embedder generates an aligned embedding. The embedding is extracted from the hidden state of the final output token [E] at the penultimate layer. (b) Before SVG embe… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Examples of Text-to-Image, Text-to-Image with SVG Code Retrieval. The icon with the yellow border represents the target image. Given a text query, each model retrieves images ranked from 1st to 5th based on cosine similarity. Compared with CLIP and LLaVE, our methods retrieved more semantically aligned with the query. statistics in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Self-Cosine Similarity Distributions of Raster Image Embeddings and Embeddings Combining Raster Images with generated SVG Code in Qwen2.5-VL-7B. We find that the em￾beddings combining raster images and generated SVG code tended to have lower cosine similarity scores in all settings. This suggests that adding generated SVG code can lead to more distinct and use￾ful embeddings. 4.4. Ablation Studies Ablation… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on Layer Index for Raster Image Embed￾ding Extraction. We evaluate retrieval performance across layers when selecting the last token embedding. Performance generally improves in the later layers, with the penultimate layer consistently offering the best trade-off between stability and retrieval quality [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Self-Cosine Similarity Distributions of Raster Image Embeddings in Qwen2.5-VL-7B, evaluated across layers and pooling strategies. (top) Self-similarity distributions when ex￾tracting the last-token embedding from different layers. Using the penultimate layer shifts the distribution toward lower cosine sim￾ilarity values compared to the last layer, indicating more diverse and expressive representations. (bo… view at source ↗
read the original abstract

Scalable Vector Graphics (SVGs) function both as visual images and as structured code that encode rich geometric and layout information, yet most methods rasterize them and discard this symbolic organization. At the same time, recent sentence embedding methods produce strong text representations but do not naturally extend to visual or structured modalities. We propose a training-free, instruction-guided multimodal embedding framework that uses a Multimodal Large Language Model (MLLM) to map text, raster images, and SVG code into an aligned embedding space. We control the direction of embeddings through modality-specific instructions and structural SVG cues, eliminating the need for learned projection heads or contrastive training. Our method has two key components: (1) Multimodal Explicit One-word Limitation (mEOL), which instructs the MLLM to summarize any multimodal input into a single token whose hidden state serves as a compact semantic embedding. (2) A semantic SVG rewriting module that assigns meaningful identifiers and simplifies nested SVG elements through visual reasoning over the rendered image, exposing geometric and relational cues hidden in raw code. Using a repurposed VGBench, we build the first text-to-SVG retrieval benchmark and show that our training-free embeddings outperform encoder-based and training-based multimodal baselines. These results highlight prompt-level control as an effective alternative to parameter-level training for structure-aware multimodal retrieval. Project page: https://scene-the-ella.github.io/meol/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes mEOL, a training-free multimodal embedding method that repurposes an off-the-shelf MLLM to map text, raster images, and SVG code into a shared space. Modality-specific instructions plus the Multimodal Explicit One-word Limitation (mEOL) force the model to produce a single-token summary whose hidden state is taken as the embedding; a semantic SVG rewriting module uses visual reasoning on the rendered image to assign meaningful identifiers and simplify nested elements. The authors repurpose VGBench into the first text-to-SVG retrieval benchmark and report that the resulting embeddings outperform both encoder-based and training-based multimodal baselines.

Significance. If the empirical claims hold, the work is significant for demonstrating that prompt-level control and forced single-token summarization can substitute for learned projection heads or contrastive objectives in structure-aware retrieval tasks. The training-free design and the new text-to-SVG benchmark are concrete contributions that could lower barriers for incorporating vector graphics into multimodal systems.

major comments (2)
  1. [§3.1] §3.1 (mEOL definition): The central claim that the hidden state of the forced one-word summary token yields semantically aligned cross-modal embeddings rests on an untested assumption. No analysis, ablation, or visualization is provided showing that cosine similarity between a text prompt and its corresponding SVG code reflects semantic similarity rather than instruction-following artifacts or the MLLM's next-token-prediction geometry. This alignment is load-bearing for attributing outperformance to the method.
  2. [§4.3] §4.3 and Table 2 (retrieval results): The reported outperformance on the repurposed VGBench text-to-SVG task lacks error bars, statistical significance tests, and an ablation isolating the contribution of the SVG rewriting module versus the base mEOL prompting. Without these, it is unclear whether gains are robust or driven by benchmark construction choices.
minor comments (2)
  1. [§3] The acronym mEOL is introduced in the abstract and §3 but its expansion ('Multimodal Explicit One-word Limitation') is not restated at first use in the method section, which may confuse readers.
  2. [Figure 2] Figure 2 (SVG rewriting examples): The before/after code snippets would benefit from explicit highlighting of the geometric cues that the visual-reasoning step is claimed to expose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (mEOL definition): The central claim that the hidden state of the forced one-word summary token yields semantically aligned cross-modal embeddings rests on an untested assumption. No analysis, ablation, or visualization is provided showing that cosine similarity between a text prompt and its corresponding SVG code reflects semantic similarity rather than instruction-following artifacts or the MLLM's next-token-prediction geometry. This alignment is load-bearing for attributing outperformance to the method.

    Authors: We acknowledge that the current manuscript does not contain explicit ablations, visualizations, or analyses isolating semantic alignment from potential instruction-following or geometric artifacts in the MLLM's hidden states. The performance gains on the text-to-SVG retrieval task provide supporting evidence through comparison to baselines, but we agree this is indirect. In the revised version we will add t-SNE visualizations of cross-modal embeddings and an ablation contrasting mEOL's single-token constraint against standard multi-token prompting to more directly substantiate the alignment claim. revision: yes

  2. Referee: [§4.3] §4.3 and Table 2 (retrieval results): The reported outperformance on the repurposed VGBench text-to-SVG task lacks error bars, statistical significance tests, and an ablation isolating the contribution of the SVG rewriting module versus the base mEOL prompting. Without these, it is unclear whether gains are robust or driven by benchmark construction choices.

    Authors: We agree that the absence of error bars, significance testing, and a targeted ablation limits the strength of the empirical claims. We will recompute the retrieval metrics with multiple runs to include error bars and paired statistical significance tests in Table 2. We will also add a new ablation subsection in §4.3 that reports results for mEOL with and without the semantic SVG rewriting module (holding all other components fixed) to isolate its specific contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on benchmark comparisons rather than self-referential derivations

full rationale

The paper describes a prompting-based method (mEOL single-token extraction plus SVG rewriting via visual reasoning) applied to an off-the-shelf MLLM, then reports retrieval performance on a repurposed VGBench benchmark against encoder and training-based baselines. No equations, fitted parameters, or derivations are presented that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims are therefore externally falsifiable through the reported empirical comparisons and do not collapse into self-definition or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the capability of MLLMs to generate useful embeddings via prompted single-token outputs and the effectiveness of visual reasoning for SVG simplification, without independent evidence provided in the abstract.

axioms (2)
  • domain assumption Multimodal LLMs can effectively summarize inputs from different modalities into a single token whose hidden state serves as a semantic embedding.
    This is the basis for the mEOL component described in the abstract.
  • domain assumption Visual reasoning by the MLLM over rendered SVG images can accurately assign meaningful identifiers and simplify nested elements to expose geometric cues.
    This underpins the semantic SVG rewriting module.
invented entities (2)
  • mEOL (Multimodal Explicit One-word Limitation) no independent evidence
    purpose: To create compact, aligned embeddings from multimodal inputs without training.
    Newly proposed technique in the paper.
  • Semantic SVG rewriting module no independent evidence
    purpose: To preprocess SVG code for better understanding by the MLLM using visual cues.
    Novel preprocessing step introduced.

pith-pipeline@v0.9.0 · 5566 in / 1530 out tokens · 59226 ms · 2026-05-10T06:43:52.547120+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 24 canonical work pages · 7 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 5, 6

  2. [2]

    Beyond the highlights: Video retrieval with salient and surrounding contexts

    Jaehun Bang, Moon Ye-Bin, Tae-Hyun Oh, and Kyungdon Joo. Beyond the highlights: Video retrieval with salient and surrounding contexts. InProceedings of the Winter Con- ference on Applications of Computer Vision (WACV), 2026. 2

  3. [3]

    Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024

    Parishad BehnamGhader, Vaibhav Adlakha, Marius Mos- bach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024. 2

  4. [4]

    Deepicon: A hierarchical network for layer-wise icon vectorization

    Qi Bing, Chaoyi Zhang, and Weidong Cai. Deepicon: A hierarchical network for layer-wise icon vectorization. In 2024 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pages 70–77. IEEE,

  5. [5]

    Leveraging large language models for scalable vector graphics-driven image understanding.arXiv preprint arXiv:2306.06094, 2023

    Mu Cai, Zeyi Huang, Yuheng Li, Utkarsh Ojha, Haohan Wang, and Yong Jae Lee. Leveraging large language models for scalable vector graphics-driven image understanding.arXiv preprint arXiv:2306.06094, 2023. 2

  6. [6]

    An investigation on llms’ visual under- standing ability using svg for image-text bridging

    Mu Cai, Zeyi Huang, Yuheng Li, Utkarsh Ojha, Haohan Wang, and Yong Jae Lee. An investigation on llms’ visual under- standing ability using svg for image-text bridging. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5377–5386. IEEE, 2025. 2

  7. [7]

    Svgformer: Representation learning for continuous vector graphics using transformers

    Defu Cao, Zhaowen Wang, Jose Echevarria, and Yan Liu. Svgformer: Representation learning for continuous vector graphics using transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10093–10102, 2023. 2

  8. [8]

    Deepsvg: A hierarchical generative network for vector graphics animation.Advances in Neural Informa- tion Processing Systems, 33:16351–16361, 2020

    Alexandre Carlier, Martin Danelljan, Alexandre Alahi, and Radu Timofte. Deepsvg: A hierarchical generative network for vector graphics animation.Advances in Neural Informa- tion Processing Systems, 33:16351–16361, 2020. 1

  9. [9]

    Svgenius: Benchmarking llms in svg understanding, editing and generation.arXiv preprint arXiv:2506.03139, 2025

    Siqi Chen, Xinyu Dong, Haolei Xu, Xingyu Wu, Fei Tang, Hang Zhang, Yuchen Yan, Linjuan Wu, Wenqi Zhang, Guiyang Hou, et al. Svgenius: Benchmarking llms in svg understanding, editing and generation.arXiv preprint arXiv:2506.03139, 2025. 2

  10. [10]

    Patch- wise retrieval: A bag of practical techniques for instance- level matching

    Wonseok Choi, Sohwi Lim, Nam Hyeon-Woo, Moon Ye-Bin, Dong-Ju Jeong, Jinyoung Hwang, and Tae-Hyun Oh. Patch- wise retrieval: A bag of practical techniques for instance- level matching. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), 2026. 2

  11. [11]

    Diffcse: Difference-based contrastive learning for sentence embeddings.arXiv preprint arXiv:2204.10298, 2022

    Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljaˇci´c, Shang-Wen Li, Wen-tau Yih, Yoon Kim, and James Glass. Diffcse: Difference-based contrastive learning for sentence embeddings.arXiv preprint arXiv:2204.10298, 2022. 2

  12. [12]

    How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings.arXiv preprint arXiv:1909.00512, 2019

    Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings.arXiv preprint arXiv:1909.00512, 2019. 8

  13. [13]

    iuniverse Bloomington,

    Jon Ferraiolo, Fujisawa Jun, and Dean Jackson.Scalable vec- tor graphics (SVG) 1.0 specification. iuniverse Bloomington,

  14. [14]

    SimCSE: Simple Contrastive Learning of Sentence Embeddings

    Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings.arXiv preprint arXiv:2104.08821, 2021. 2

  15. [15]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  16. [16]

    Vec- torpainter: Advanced stylized vector graphics synthesis using stroke-style priors.arXiv preprint arXiv:2405.02962, 2024

    Juncheng Hu, Ximing Xing, Jing Zhang, and Qian Yu. Vec- torpainter: Advanced stylized vector graphics synthesis using stroke-style priors.arXiv preprint arXiv:2405.02962, 2024. 2

  17. [17]

    Vectorfusion: Text- to-svg by abstracting pixel-based diffusion models

    Ajay Jain, Amber Xie, and Pieter Abbeel. Vectorfusion: Text- to-svg by abstracting pixel-based diffusion models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1911–1920, 2023. 2

  18. [18]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lu- cile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ee Lacroix, and William El Sayed. Mistral 7b, 2023. 6

  19. [19]

    PromptBERT: Improving BERT Sentence Embeddings with Prompts,

    Ting Jiang, Jian Jiao, Shaohan Huang, Zihan Zhang, Deqing Wang, Fuzhen Zhuang, Furu Wei, Haizhen Huang, Denvy Deng, and Qi Zhang. Promptbert: Improving bert sentence embeddings with prompts.arXiv preprint arXiv:2201.04337,

  20. [20]

    Scaling sentence embeddings with large language models

    Ting Jiang, Shaohan Huang, Zhongzhi Luan, Deqing Wang, and Fuzhen Zhuang. Scaling sentence embeddings with large language models. InFindings of the Association for Compu- tational Linguistics: EMNLP 2024, pages 3182–3196, 2024. 2, 6, 7

  21. [21]

    Jiang, R

    Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks.arXiv preprint arXiv:2410.05160, 2024. 3, 5, 6

  22. [22]

    12 EmbeddingGemma: Powerful and Lightweight Text Representations C

    Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jin- song Su. Llave: Large language and vision embedding models with hardness-weighted contrastive learning.arXiv preprint arXiv:2503.04812, 2025. 3, 5, 6

  23. [23]

    NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

    Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv- embed: Improved techniques for training llms as generalist embedding models.arXiv preprint arXiv:2405.17428, 2024. 2

  24. [24]

    Meta-task prompting elicits em- beddings from large language models.arXiv preprint arXiv:2402.18458, 2024

    Yibin Lei, Di Wu, Tianyi Zhou, Tao Shen, Yu Cao, Chongyang Tao, and Andrew Yates. Meta-task prompting elicits em- beddings from large language models.arXiv preprint arXiv:2402.18458, 2024. 2, 7

  25. [25]

    Align be- fore fuse: Vision and language representation learning with momentum distillation

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align be- fore fuse: Vision and language representation learning with momentum distillation. InAdvances in Neural Information Processing Systems, pages 9694–9705. Curran Associates, Inc., 2021. 2

  26. [26]

    Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR,

  27. [27]

    A learned representation for scalable vector graphics

    Raphael Gontijo Lopes, David Ha, Douglas Eck, and Jonathon Shlens. A learned representation for scalable vector graphics. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7930–7939, 2019. 2

  28. [28]

    Sfrembedding-mistral: en- hance text retrieval with transfer learning.Salesforce AI Research Blog, 3:6, 2024

    Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfrembedding-mistral: en- hance text retrieval with transfer learning.Salesforce AI Research Blog, 3:6, 2024. 2

  29. [29]

    MTEB: Massive Text Embedding Benchmark

    Niklas Muennighoff, Nouamane Tazi, Lo¨ıc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark.arXiv preprint arXiv:2210.07316, 2022. 2

  30. [30]

    Gen- erative representational instruction tuning

    Niklas Muennighoff, SU Hongjin, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Gen- erative representational instruction tuning. InICLR 2024 Workshop: How Far Are We From AGI, 2024. 2

  31. [31]

    arXiv preprint arXiv:2108.08877 , year=

    Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant, Ji Ma, Keith B Hall, Daniel Cer, and Yinfei Yang. Sentence- t5: Scalable sentence encoders from pre-trained text-to-text models.arXiv preprint arXiv:2108.08877, 2021. 2

  32. [32]

    Can large language models understand symbolic graphics programs? arXiv preprint arXiv:2408.08313, 2024

    Zeju Qiu, Weiyang Liu, Haiwen Feng, Zhen Liu, Tim Z Xiao, Katherine M Collins, Joshua B Tenenbaum, Adrian Weller, Michael J Black, and Bernhard Sch¨olkopf. Can large language models understand symbolic graphics programs?(2024).URL https://arxiv. org/abs/2408.08313, 57. 2

  33. [33]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 5

  34. [34]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084, 2019. 2

  35. [35]

    Starvector: Generating scalable vector graphics code from images.CVPR, 2025

    Juan A Rodriguez, Shubham Agarwal, Issam H Laradji, Pau Rodriguez, David Vazquez, Christopher Pal, and Marco Ped- ersoli. Starvector: Generating scalable vector graphics code from images.CVPR, 2025. 2

  36. [36]

    Layertracer: Cognitive-aligned layered svg synthesis via diffusion transformer.arXiv preprint arXiv:2502.01105, 2025

    Yiren Song, Danze Chen, and Mike Zheng Shou. Layer- tracer: Cognitive-aligned layered svg synthesis via diffusion transformer.arXiv preprint arXiv:2502.01105, 2025. 2

  37. [37]

    Strokenuwa: Tokeniz- ing strokes for vector graphic synthesis.arXiv preprint arXiv:2401.17093, 2024

    Zecheng Tang, Chenfei Wu, Zekai Zhang, Mingheng Ni, Shengming Yin, Yu Liu, Zhengyuan Yang, Lijuan Wang, Zicheng Liu, Juntao Li, et al. Strokenuwa: Tokeniz- ing strokes for vector graphic synthesis.arXiv preprint arXiv:2401.17093, 2024. 2

  38. [38]

    Geneol: Harnessing the generative power of llms for training-free sen- tence embeddings.arXiv preprint arXiv:2410.14635, 2024

    Raghuveer Thirukovalluru and Bhuwan Dhingra. Geneol: Harnessing the generative power of llms for training-free sen- tence embeddings.arXiv preprint arXiv:2410.14635, 2024. 2, 4, 7

  39. [39]

    Visually de- scriptive language model for vector graphics reasoning.arXiv preprint arXiv:2404.06479, 2024

    Zhenhailong Wang, Joy Hsu, Xingyao Wang, Kuan-Hao Huang, Manling Li, Jiajun Wu, and Heng Ji. Visually de- scriptive language model for vector graphics reasoning.arXiv preprint arXiv:2404.06479, 2024. 2, 4

  40. [40]

    Chat2svg: Vector graphics generation with large language models and image diffusion models.CVPR, 2024

    Ronghuan Wu, Wanchao Su, and Jing Liao. Chat2svg: Vector graphics generation with large language models and image diffusion models.CVPR, 2024. 2

  41. [41]

    SVGFusion: A VAE-Diffusion Transformer for Vector Graphic Generation

    Ximing Xing, Juncheng Hu, Jing Zhang, Dong Xu, and Qian Yu. Svgfusion: Scalable text-to-svg generation via vector space diffusion.arXiv preprint arXiv:2412.10437, 2024. 2

  42. [42]

    Svgdreamer: Text guided svg generation with diffusion model

    Ximing Xing, Haitao Zhou, Chuang Wang, Jing Zhang, Dong Xu, and Qian Yu. Svgdreamer: Text guided svg generation with diffusion model. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 4546–4555, 2024. 2

  43. [43]

    Omnisvg: A unified scalable vector graphics generation model.arXiv preprint arXiv:2504.06263, 2025

    Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, and Yu-Gang Jiang. Omnisvg: A unified scalable vector graphics generation model.arXiv preprint arXiv:2504.06263, 2025. 2

  44. [44]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 2, 5

  45. [45]

    Simple tech- niques for enhancing sentence embeddings in generative lan- guage models

    Bowen Zhang, Kehua Chang, and Chunping Li. Simple tech- niques for enhancing sentence embeddings in generative lan- guage models. InInternational Conference on Intelligent Computing, pages 52–64. Springer, 2024. 2, 4, 6, 7

  46. [46]

    Vg- bench: Evaluating large language models on vector graphics understanding and generation.CoRR, 2024

    Bocheng Zou, Mu Cai, Jianrui Zhang, and Yong Jae Lee. Vg- bench: Evaluating large language models on vector graphics understanding and generation.CoRR, 2024. 2