arxiv: 2604.17054 · v1 · submitted 2026-04-18 · 💻 cs.CV · cs.AI

Recognition: unknown

mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval

Kyeong Seon Kim , Baek Seong-Eun , Lee Jung-Mok , Tae-Hyun Oh

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multimodal embeddingsSVGvector graphicstraining-freeMLLMretrievalinstruction-guided

0 comments

The pith

A training-free method aligns embeddings for text, images, and SVG code by instructing a multimodal LLM to summarize each input as a single token.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces mEOL, a framework that repurposes a multimodal large language model to create embeddings for different modalities without any training. It uses specific instructions to force the model to condense text, image, or SVG inputs into one token, taking the hidden state as the embedding. For SVGs, an additional step rewrites the code by reasoning about the rendered image to add meaningful names and simplify structure. This setup is tested on a new text-to-SVG retrieval benchmark built from VGBench, where it outperforms both encoder-based and training-based methods.

Core claim

The authors claim that modality-specific instructions combined with one-word limitation in an MLLM produce hidden states that serve as aligned embeddings for text, raster images, and SVG code, and that semantic rewriting of SVG using visual reasoning further improves structure awareness, leading to better retrieval performance than trained alternatives on the introduced benchmark.

What carries the argument

The Multimodal Explicit One-word Limitation (mEOL) mechanism, which directs the MLLM to produce a single token summary for any input modality, with its hidden state used as the embedding; supported by a semantic SVG rewriting module that uses visual reasoning to expose geometric cues.

If this is right

Embeddings for SVGs can be generated without rasterization or specialized training.
Prompt control replaces the need for learned projection layers in multimodal alignment.
The first dedicated benchmark for text-to-SVG retrieval is established and solved effectively by this method.
Structured code like SVG benefits from explicit rewriting based on visual interpretation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar single-token prompting might enable zero-shot alignment in other domains such as 3D models or molecular structures.
This could make multimodal retrieval more accessible by eliminating the need for large training datasets.
The method implies that MLLMs' pre-trained knowledge suffices for cross-modal semantic matching when guided properly.

Load-bearing premise

That the hidden state corresponding to the single summary token generated under modality-specific instructions carries semantic meaning that is consistent and aligned across text, images, and SVG representations without further training.

What would settle it

Demonstrating that the proposed embeddings do not achieve higher retrieval accuracy than the encoder-based and training-based baselines on the text-to-SVG benchmark would falsify the superiority claim.

Figures

Figures reproduced from arXiv: 2604.17054 by Baek Seong-Eun, Kyeong Seon Kim, Lee Jung-Mok, Tae-Hyun Oh.

**Figure 1.** Figure 1: Overview of Our Training-free Multimodal Embedding Method. (a) The semantic SVG module rewrites raw SVG code by assigning meaningful identifiers and simplifying structure through MLLM-based visual reasoning. (b) Multimodal Explicit One-word Limitation (mEOL) prompts instruct the MLLM to summarize text, images, and SVG code into a single-token embedding, placing all modalities in an aligned space. (c) The r… view at source ↗

**Figure 2.** Figure 2: Components of Our Training-free Multimodal Embedding pipeline. (a) Given text, image, or SVG inputs, each input is paired with a multimodal EOL (mEOL): Semantic EOL Prompt for text and Visual EOL Prompt for images. Then, our training-free MLLM embedder generates an aligned embedding. The embedding is extracted from the hidden state of the final output token [E] at the penultimate layer. (b) Before SVG embe… view at source ↗

**Figure 3.** Figure 3: Qualitative Examples of Text-to-Image, Text-to-Image with SVG Code Retrieval. The icon with the yellow border represents the target image. Given a text query, each model retrieves images ranked from 1st to 5th based on cosine similarity. Compared with CLIP and LLaVE, our methods retrieved more semantically aligned with the query. statistics in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Self-Cosine Similarity Distributions of Raster Image Embeddings and Embeddings Combining Raster Images with generated SVG Code in Qwen2.5-VL-7B. We find that the embeddings combining raster images and generated SVG code tended to have lower cosine similarity scores in all settings. This suggests that adding generated SVG code can lead to more distinct and useful embeddings. 4.4. Ablation Studies Ablation… view at source ↗

**Figure 5.** Figure 5: Ablation on Layer Index for Raster Image Embedding Extraction. We evaluate retrieval performance across layers when selecting the last token embedding. Performance generally improves in the later layers, with the penultimate layer consistently offering the best trade-off between stability and retrieval quality [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Self-Cosine Similarity Distributions of Raster Image Embeddings in Qwen2.5-VL-7B, evaluated across layers and pooling strategies. (top) Self-similarity distributions when extracting the last-token embedding from different layers. Using the penultimate layer shifts the distribution toward lower cosine similarity values compared to the last layer, indicating more diverse and expressive representations. (bo… view at source ↗

read the original abstract

Scalable Vector Graphics (SVGs) function both as visual images and as structured code that encode rich geometric and layout information, yet most methods rasterize them and discard this symbolic organization. At the same time, recent sentence embedding methods produce strong text representations but do not naturally extend to visual or structured modalities. We propose a training-free, instruction-guided multimodal embedding framework that uses a Multimodal Large Language Model (MLLM) to map text, raster images, and SVG code into an aligned embedding space. We control the direction of embeddings through modality-specific instructions and structural SVG cues, eliminating the need for learned projection heads or contrastive training. Our method has two key components: (1) Multimodal Explicit One-word Limitation (mEOL), which instructs the MLLM to summarize any multimodal input into a single token whose hidden state serves as a compact semantic embedding. (2) A semantic SVG rewriting module that assigns meaningful identifiers and simplifies nested SVG elements through visual reasoning over the rendered image, exposing geometric and relational cues hidden in raw code. Using a repurposed VGBench, we build the first text-to-SVG retrieval benchmark and show that our training-free embeddings outperform encoder-based and training-based multimodal baselines. These results highlight prompt-level control as an effective alternative to parameter-level training for structure-aware multimodal retrieval. Project page: https://scene-the-ella.github.io/meol/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's training-free mEOL trick for turning MLLM hidden states into cross-modal embeddings for SVGs is the main claim, but it rests on unverified alignment from off-the-shelf prompting.

read the letter

The main takeaway is a training-free approach that prompts an MLLM to summarize text, images, or SVG code into one token and uses its hidden state as the embedding vector. They add a semantic rewriting step that cleans up SVG code based on visual reasoning over the rendered image, giving meaningful names and simplifying structure. They also repurpose VGBench into the first text-to-SVG retrieval benchmark and report that their embeddings beat both encoder-based and trained multimodal baselines. That benchmark is genuinely useful on its own, and skipping any projection layers or contrastive training keeps the method lightweight and easy to apply. The rewriting module is a practical addition for exposing geometric relations that raw SVG code hides. The soft spot is the core assumption that a single forced token's hidden state will land in a shared space across modalities. The MLLM was trained for next-token prediction, not retrieval, so nothing explicitly pulls a text description and its matching SVG into the same region. If the reported gains come mostly from prompt sensitivity or how the benchmark was built rather than reliable alignment, the outperformance won't generalize. The abstract gives no numbers or ablations, so the full experiments need to show consistent margins with controls for prompt variation. This is worth a look for anyone working on retrieval with structured graphics or lightweight multimodal setups. A reader who wants quick ways to handle SVGs without retraining would find the benchmark and the no-training angle practical. It deserves peer review because the benchmark is new and the method is simple enough to reproduce and stress-test quickly.

Referee Report

2 major / 2 minor

Summary. The paper proposes mEOL, a training-free multimodal embedding method that repurposes an off-the-shelf MLLM to map text, raster images, and SVG code into a shared space. Modality-specific instructions plus the Multimodal Explicit One-word Limitation (mEOL) force the model to produce a single-token summary whose hidden state is taken as the embedding; a semantic SVG rewriting module uses visual reasoning on the rendered image to assign meaningful identifiers and simplify nested elements. The authors repurpose VGBench into the first text-to-SVG retrieval benchmark and report that the resulting embeddings outperform both encoder-based and training-based multimodal baselines.

Significance. If the empirical claims hold, the work is significant for demonstrating that prompt-level control and forced single-token summarization can substitute for learned projection heads or contrastive objectives in structure-aware retrieval tasks. The training-free design and the new text-to-SVG benchmark are concrete contributions that could lower barriers for incorporating vector graphics into multimodal systems.

major comments (2)

[§3.1] §3.1 (mEOL definition): The central claim that the hidden state of the forced one-word summary token yields semantically aligned cross-modal embeddings rests on an untested assumption. No analysis, ablation, or visualization is provided showing that cosine similarity between a text prompt and its corresponding SVG code reflects semantic similarity rather than instruction-following artifacts or the MLLM's next-token-prediction geometry. This alignment is load-bearing for attributing outperformance to the method.
[§4.3] §4.3 and Table 2 (retrieval results): The reported outperformance on the repurposed VGBench text-to-SVG task lacks error bars, statistical significance tests, and an ablation isolating the contribution of the SVG rewriting module versus the base mEOL prompting. Without these, it is unclear whether gains are robust or driven by benchmark construction choices.

minor comments (2)

[§3] The acronym mEOL is introduced in the abstract and §3 but its expansion ('Multimodal Explicit One-word Limitation') is not restated at first use in the method section, which may confuse readers.
[Figure 2] Figure 2 (SVG rewriting examples): The before/after code snippets would benefit from explicit highlighting of the geometric cues that the visual-reasoning step is claimed to expose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [§3.1] §3.1 (mEOL definition): The central claim that the hidden state of the forced one-word summary token yields semantically aligned cross-modal embeddings rests on an untested assumption. No analysis, ablation, or visualization is provided showing that cosine similarity between a text prompt and its corresponding SVG code reflects semantic similarity rather than instruction-following artifacts or the MLLM's next-token-prediction geometry. This alignment is load-bearing for attributing outperformance to the method.

Authors: We acknowledge that the current manuscript does not contain explicit ablations, visualizations, or analyses isolating semantic alignment from potential instruction-following or geometric artifacts in the MLLM's hidden states. The performance gains on the text-to-SVG retrieval task provide supporting evidence through comparison to baselines, but we agree this is indirect. In the revised version we will add t-SNE visualizations of cross-modal embeddings and an ablation contrasting mEOL's single-token constraint against standard multi-token prompting to more directly substantiate the alignment claim. revision: yes
Referee: [§4.3] §4.3 and Table 2 (retrieval results): The reported outperformance on the repurposed VGBench text-to-SVG task lacks error bars, statistical significance tests, and an ablation isolating the contribution of the SVG rewriting module versus the base mEOL prompting. Without these, it is unclear whether gains are robust or driven by benchmark construction choices.

Authors: We agree that the absence of error bars, significance testing, and a targeted ablation limits the strength of the empirical claims. We will recompute the retrieval metrics with multiple runs to include error bars and paired statistical significance tests in Table 2. We will also add a new ablation subsection in §4.3 that reports results for mEOL with and without the semantic SVG rewriting module (holding all other components fixed) to isolate its specific contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on benchmark comparisons rather than self-referential derivations

full rationale

The paper describes a prompting-based method (mEOL single-token extraction plus SVG rewriting via visual reasoning) applied to an off-the-shelf MLLM, then reports retrieval performance on a repurposed VGBench benchmark against encoder and training-based baselines. No equations, fitted parameters, or derivations are presented that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims are therefore externally falsifiable through the reported empirical comparisons and do not collapse into self-definition or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the capability of MLLMs to generate useful embeddings via prompted single-token outputs and the effectiveness of visual reasoning for SVG simplification, without independent evidence provided in the abstract.

axioms (2)

domain assumption Multimodal LLMs can effectively summarize inputs from different modalities into a single token whose hidden state serves as a semantic embedding.
This is the basis for the mEOL component described in the abstract.
domain assumption Visual reasoning by the MLLM over rendered SVG images can accurately assign meaningful identifiers and simplify nested elements to expose geometric cues.
This underpins the semantic SVG rewriting module.

invented entities (2)

mEOL (Multimodal Explicit One-word Limitation) no independent evidence
purpose: To create compact, aligned embeddings from multimodal inputs without training.
Newly proposed technique in the paper.
Semantic SVG rewriting module no independent evidence
purpose: To preprocess SVG code for better understanding by the MLLM using visual cues.
Novel preprocessing step introduced.

pith-pipeline@v0.9.0 · 5566 in / 1530 out tokens · 59226 ms · 2026-05-10T06:43:52.547120+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 24 canonical work pages · 7 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Beyond the highlights: Video retrieval with salient and surrounding contexts

Jaehun Bang, Moon Ye-Bin, Tae-Hyun Oh, and Kyungdon Joo. Beyond the highlights: Video retrieval with salient and surrounding contexts. InProceedings of the Winter Con- ference on Applications of Computer Vision (WACV), 2026. 2

2026
[3]

Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024

Parishad BehnamGhader, Vaibhav Adlakha, Marius Mos- bach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024. 2

work page arXiv 2024
[4]

Deepicon: A hierarchical network for layer-wise icon vectorization

Qi Bing, Chaoyi Zhang, and Weidong Cai. Deepicon: A hierarchical network for layer-wise icon vectorization. In 2024 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pages 70–77. IEEE,

2024
[5]

Leveraging large language models for scalable vector graphics-driven image understanding.arXiv preprint arXiv:2306.06094, 2023

Mu Cai, Zeyi Huang, Yuheng Li, Utkarsh Ojha, Haohan Wang, and Yong Jae Lee. Leveraging large language models for scalable vector graphics-driven image understanding.arXiv preprint arXiv:2306.06094, 2023. 2

work page arXiv 2023
[6]

An investigation on llms’ visual under- standing ability using svg for image-text bridging

Mu Cai, Zeyi Huang, Yuheng Li, Utkarsh Ojha, Haohan Wang, and Yong Jae Lee. An investigation on llms’ visual under- standing ability using svg for image-text bridging. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5377–5386. IEEE, 2025. 2

2025
[7]

Svgformer: Representation learning for continuous vector graphics using transformers

Defu Cao, Zhaowen Wang, Jose Echevarria, and Yan Liu. Svgformer: Representation learning for continuous vector graphics using transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10093–10102, 2023. 2

2023
[8]

Deepsvg: A hierarchical generative network for vector graphics animation.Advances in Neural Informa- tion Processing Systems, 33:16351–16361, 2020

Alexandre Carlier, Martin Danelljan, Alexandre Alahi, and Radu Timofte. Deepsvg: A hierarchical generative network for vector graphics animation.Advances in Neural Informa- tion Processing Systems, 33:16351–16361, 2020. 1

2020
[9]

Svgenius: Benchmarking llms in svg understanding, editing and generation.arXiv preprint arXiv:2506.03139, 2025

Siqi Chen, Xinyu Dong, Haolei Xu, Xingyu Wu, Fei Tang, Hang Zhang, Yuchen Yan, Linjuan Wu, Wenqi Zhang, Guiyang Hou, et al. Svgenius: Benchmarking llms in svg understanding, editing and generation.arXiv preprint arXiv:2506.03139, 2025. 2

work page arXiv 2025
[10]

Patch- wise retrieval: A bag of practical techniques for instance- level matching

Wonseok Choi, Sohwi Lim, Nam Hyeon-Woo, Moon Ye-Bin, Dong-Ju Jeong, Jinyoung Hwang, and Tae-Hyun Oh. Patch- wise retrieval: A bag of practical techniques for instance- level matching. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), 2026. 2

2026
[11]

Diffcse: Difference-based contrastive learning for sentence embeddings.arXiv preprint arXiv:2204.10298, 2022

Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljaˇci´c, Shang-Wen Li, Wen-tau Yih, Yoon Kim, and James Glass. Diffcse: Difference-based contrastive learning for sentence embeddings.arXiv preprint arXiv:2204.10298, 2022. 2

work page arXiv 2022
[12]

How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings.arXiv preprint arXiv:1909.00512, 2019

Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings.arXiv preprint arXiv:1909.00512, 2019. 8

work page arXiv 1909
[13]

iuniverse Bloomington,

Jon Ferraiolo, Fujisawa Jun, and Dean Jackson.Scalable vec- tor graphics (SVG) 1.0 specification. iuniverse Bloomington,
[14]

SimCSE: Simple Contrastive Learning of Sentence Embeddings

Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings.arXiv preprint arXiv:2104.08821, 2021. 2

work page internal anchor Pith review arXiv 2021
[15]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Vec- torpainter: Advanced stylized vector graphics synthesis using stroke-style priors.arXiv preprint arXiv:2405.02962, 2024

Juncheng Hu, Ximing Xing, Jing Zhang, and Qian Yu. Vec- torpainter: Advanced stylized vector graphics synthesis using stroke-style priors.arXiv preprint arXiv:2405.02962, 2024. 2

work page arXiv 2024
[17]

Vectorfusion: Text- to-svg by abstracting pixel-based diffusion models

Ajay Jain, Amber Xie, and Pieter Abbeel. Vectorfusion: Text- to-svg by abstracting pixel-based diffusion models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1911–1920, 2023. 2

1911
[18]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lu- cile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ee Lacroix, and William El Sayed. Mistral 7b, 2023. 6

2023
[19]

PromptBERT: Improving BERT Sentence Embeddings with Prompts,

Ting Jiang, Jian Jiao, Shaohan Huang, Zihan Zhang, Deqing Wang, Fuzhen Zhuang, Furu Wei, Haizhen Huang, Denvy Deng, and Qi Zhang. Promptbert: Improving bert sentence embeddings with prompts.arXiv preprint arXiv:2201.04337,

work page arXiv
[20]

Scaling sentence embeddings with large language models

Ting Jiang, Shaohan Huang, Zhongzhi Luan, Deqing Wang, and Fuzhen Zhuang. Scaling sentence embeddings with large language models. InFindings of the Association for Compu- tational Linguistics: EMNLP 2024, pages 3182–3196, 2024. 2, 6, 7

2024
[21]

Jiang, R

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks.arXiv preprint arXiv:2410.05160, 2024. 3, 5, 6

work page arXiv 2024
[22]

12 EmbeddingGemma: Powerful and Lightweight Text Representations C

Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jin- song Su. Llave: Large language and vision embedding models with hardness-weighted contrastive learning.arXiv preprint arXiv:2503.04812, 2025. 3, 5, 6

work page arXiv 2025
[23]

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv- embed: Improved techniques for training llms as generalist embedding models.arXiv preprint arXiv:2405.17428, 2024. 2

work page internal anchor Pith review arXiv 2024
[24]

Meta-task prompting elicits em- beddings from large language models.arXiv preprint arXiv:2402.18458, 2024

Yibin Lei, Di Wu, Tianyi Zhou, Tao Shen, Yu Cao, Chongyang Tao, and Andrew Yates. Meta-task prompting elicits em- beddings from large language models.arXiv preprint arXiv:2402.18458, 2024. 2, 7

work page arXiv 2024
[25]

Align be- fore fuse: Vision and language representation learning with momentum distillation

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align be- fore fuse: Vision and language representation learning with momentum distillation. InAdvances in Neural Information Processing Systems, pages 9694–9705. Curran Associates, Inc., 2021. 2

2021
[26]

Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR,
[27]

A learned representation for scalable vector graphics

Raphael Gontijo Lopes, David Ha, Douglas Eck, and Jonathon Shlens. A learned representation for scalable vector graphics. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7930–7939, 2019. 2

2019
[28]

Sfrembedding-mistral: en- hance text retrieval with transfer learning.Salesforce AI Research Blog, 3:6, 2024

Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfrembedding-mistral: en- hance text retrieval with transfer learning.Salesforce AI Research Blog, 3:6, 2024. 2

2024
[29]

MTEB: Massive Text Embedding Benchmark

Niklas Muennighoff, Nouamane Tazi, Lo¨ıc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark.arXiv preprint arXiv:2210.07316, 2022. 2

work page internal anchor Pith review arXiv 2022
[30]

Gen- erative representational instruction tuning

Niklas Muennighoff, SU Hongjin, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Gen- erative representational instruction tuning. InICLR 2024 Workshop: How Far Are We From AGI, 2024. 2

2024
[31]

arXiv preprint arXiv:2108.08877 , year=

Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant, Ji Ma, Keith B Hall, Daniel Cer, and Yinfei Yang. Sentence- t5: Scalable sentence encoders from pre-trained text-to-text models.arXiv preprint arXiv:2108.08877, 2021. 2

work page arXiv 2021
[32]

Can large language models understand symbolic graphics programs? arXiv preprint arXiv:2408.08313, 2024

Zeju Qiu, Weiyang Liu, Haiwen Feng, Zhen Liu, Tim Z Xiao, Katherine M Collins, Joshua B Tenenbaum, Adrian Weller, Michael J Black, and Bernhard Sch¨olkopf. Can large language models understand symbolic graphics programs?(2024).URL https://arxiv. org/abs/2408.08313, 57. 2

work page arXiv 2024
[33]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 5

2021
[34]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084, 2019. 2

work page internal anchor Pith review arXiv 1908
[35]

Starvector: Generating scalable vector graphics code from images.CVPR, 2025

Juan A Rodriguez, Shubham Agarwal, Issam H Laradji, Pau Rodriguez, David Vazquez, Christopher Pal, and Marco Ped- ersoli. Starvector: Generating scalable vector graphics code from images.CVPR, 2025. 2

2025
[36]

Layertracer: Cognitive-aligned layered svg synthesis via diffusion transformer.arXiv preprint arXiv:2502.01105, 2025

Yiren Song, Danze Chen, and Mike Zheng Shou. Layer- tracer: Cognitive-aligned layered svg synthesis via diffusion transformer.arXiv preprint arXiv:2502.01105, 2025. 2

work page arXiv 2025
[37]

Strokenuwa: Tokeniz- ing strokes for vector graphic synthesis.arXiv preprint arXiv:2401.17093, 2024

Zecheng Tang, Chenfei Wu, Zekai Zhang, Mingheng Ni, Shengming Yin, Yu Liu, Zhengyuan Yang, Lijuan Wang, Zicheng Liu, Juntao Li, et al. Strokenuwa: Tokeniz- ing strokes for vector graphic synthesis.arXiv preprint arXiv:2401.17093, 2024. 2

work page arXiv 2024
[38]

Geneol: Harnessing the generative power of llms for training-free sen- tence embeddings.arXiv preprint arXiv:2410.14635, 2024

Raghuveer Thirukovalluru and Bhuwan Dhingra. Geneol: Harnessing the generative power of llms for training-free sen- tence embeddings.arXiv preprint arXiv:2410.14635, 2024. 2, 4, 7

work page arXiv 2024
[39]

Visually de- scriptive language model for vector graphics reasoning.arXiv preprint arXiv:2404.06479, 2024

Zhenhailong Wang, Joy Hsu, Xingyao Wang, Kuan-Hao Huang, Manling Li, Jiajun Wu, and Heng Ji. Visually de- scriptive language model for vector graphics reasoning.arXiv preprint arXiv:2404.06479, 2024. 2, 4

work page arXiv 2024
[40]

Chat2svg: Vector graphics generation with large language models and image diffusion models.CVPR, 2024

Ronghuan Wu, Wanchao Su, and Jing Liao. Chat2svg: Vector graphics generation with large language models and image diffusion models.CVPR, 2024. 2

2024
[41]

SVGFusion: A VAE-Diffusion Transformer for Vector Graphic Generation

Ximing Xing, Juncheng Hu, Jing Zhang, Dong Xu, and Qian Yu. Svgfusion: Scalable text-to-svg generation via vector space diffusion.arXiv preprint arXiv:2412.10437, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Svgdreamer: Text guided svg generation with diffusion model

Ximing Xing, Haitao Zhou, Chuang Wang, Jing Zhang, Dong Xu, and Qian Yu. Svgdreamer: Text guided svg generation with diffusion model. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 4546–4555, 2024. 2

2024
[43]

Omnisvg: A unified scalable vector graphics generation model.arXiv preprint arXiv:2504.06263, 2025

Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, and Yu-Gang Jiang. Omnisvg: A unified scalable vector graphics generation model.arXiv preprint arXiv:2504.06263, 2025. 2

work page arXiv 2025
[44]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 2, 5

2023
[45]

Simple tech- niques for enhancing sentence embeddings in generative lan- guage models

Bowen Zhang, Kehua Chang, and Chunping Li. Simple tech- niques for enhancing sentence embeddings in generative lan- guage models. InInternational Conference on Intelligent Computing, pages 52–64. Springer, 2024. 2, 4, 6, 7

2024
[46]

Vg- bench: Evaluating large language models on vector graphics understanding and generation.CoRR, 2024

Bocheng Zou, Mu Cai, Jianrui Zhang, and Yong Jae Lee. Vg- bench: Evaluating large language models on vector graphics understanding and generation.CoRR, 2024. 2

2024