Recognition: unknown
mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval
Pith reviewed 2026-05-10 06:43 UTC · model grok-4.3
The pith
A training-free method aligns embeddings for text, images, and SVG code by instructing a multimodal LLM to summarize each input as a single token.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that modality-specific instructions combined with one-word limitation in an MLLM produce hidden states that serve as aligned embeddings for text, raster images, and SVG code, and that semantic rewriting of SVG using visual reasoning further improves structure awareness, leading to better retrieval performance than trained alternatives on the introduced benchmark.
What carries the argument
The Multimodal Explicit One-word Limitation (mEOL) mechanism, which directs the MLLM to produce a single token summary for any input modality, with its hidden state used as the embedding; supported by a semantic SVG rewriting module that uses visual reasoning to expose geometric cues.
If this is right
- Embeddings for SVGs can be generated without rasterization or specialized training.
- Prompt control replaces the need for learned projection layers in multimodal alignment.
- The first dedicated benchmark for text-to-SVG retrieval is established and solved effectively by this method.
- Structured code like SVG benefits from explicit rewriting based on visual interpretation.
Where Pith is reading between the lines
- Similar single-token prompting might enable zero-shot alignment in other domains such as 3D models or molecular structures.
- This could make multimodal retrieval more accessible by eliminating the need for large training datasets.
- The method implies that MLLMs' pre-trained knowledge suffices for cross-modal semantic matching when guided properly.
Load-bearing premise
That the hidden state corresponding to the single summary token generated under modality-specific instructions carries semantic meaning that is consistent and aligned across text, images, and SVG representations without further training.
What would settle it
Demonstrating that the proposed embeddings do not achieve higher retrieval accuracy than the encoder-based and training-based baselines on the text-to-SVG benchmark would falsify the superiority claim.
Figures
read the original abstract
Scalable Vector Graphics (SVGs) function both as visual images and as structured code that encode rich geometric and layout information, yet most methods rasterize them and discard this symbolic organization. At the same time, recent sentence embedding methods produce strong text representations but do not naturally extend to visual or structured modalities. We propose a training-free, instruction-guided multimodal embedding framework that uses a Multimodal Large Language Model (MLLM) to map text, raster images, and SVG code into an aligned embedding space. We control the direction of embeddings through modality-specific instructions and structural SVG cues, eliminating the need for learned projection heads or contrastive training. Our method has two key components: (1) Multimodal Explicit One-word Limitation (mEOL), which instructs the MLLM to summarize any multimodal input into a single token whose hidden state serves as a compact semantic embedding. (2) A semantic SVG rewriting module that assigns meaningful identifiers and simplifies nested SVG elements through visual reasoning over the rendered image, exposing geometric and relational cues hidden in raw code. Using a repurposed VGBench, we build the first text-to-SVG retrieval benchmark and show that our training-free embeddings outperform encoder-based and training-based multimodal baselines. These results highlight prompt-level control as an effective alternative to parameter-level training for structure-aware multimodal retrieval. Project page: https://scene-the-ella.github.io/meol/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes mEOL, a training-free multimodal embedding method that repurposes an off-the-shelf MLLM to map text, raster images, and SVG code into a shared space. Modality-specific instructions plus the Multimodal Explicit One-word Limitation (mEOL) force the model to produce a single-token summary whose hidden state is taken as the embedding; a semantic SVG rewriting module uses visual reasoning on the rendered image to assign meaningful identifiers and simplify nested elements. The authors repurpose VGBench into the first text-to-SVG retrieval benchmark and report that the resulting embeddings outperform both encoder-based and training-based multimodal baselines.
Significance. If the empirical claims hold, the work is significant for demonstrating that prompt-level control and forced single-token summarization can substitute for learned projection heads or contrastive objectives in structure-aware retrieval tasks. The training-free design and the new text-to-SVG benchmark are concrete contributions that could lower barriers for incorporating vector graphics into multimodal systems.
major comments (2)
- [§3.1] §3.1 (mEOL definition): The central claim that the hidden state of the forced one-word summary token yields semantically aligned cross-modal embeddings rests on an untested assumption. No analysis, ablation, or visualization is provided showing that cosine similarity between a text prompt and its corresponding SVG code reflects semantic similarity rather than instruction-following artifacts or the MLLM's next-token-prediction geometry. This alignment is load-bearing for attributing outperformance to the method.
- [§4.3] §4.3 and Table 2 (retrieval results): The reported outperformance on the repurposed VGBench text-to-SVG task lacks error bars, statistical significance tests, and an ablation isolating the contribution of the SVG rewriting module versus the base mEOL prompting. Without these, it is unclear whether gains are robust or driven by benchmark construction choices.
minor comments (2)
- [§3] The acronym mEOL is introduced in the abstract and §3 but its expansion ('Multimodal Explicit One-word Limitation') is not restated at first use in the method section, which may confuse readers.
- [Figure 2] Figure 2 (SVG rewriting examples): The before/after code snippets would benefit from explicit highlighting of the geometric cues that the visual-reasoning step is claimed to expose.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [§3.1] §3.1 (mEOL definition): The central claim that the hidden state of the forced one-word summary token yields semantically aligned cross-modal embeddings rests on an untested assumption. No analysis, ablation, or visualization is provided showing that cosine similarity between a text prompt and its corresponding SVG code reflects semantic similarity rather than instruction-following artifacts or the MLLM's next-token-prediction geometry. This alignment is load-bearing for attributing outperformance to the method.
Authors: We acknowledge that the current manuscript does not contain explicit ablations, visualizations, or analyses isolating semantic alignment from potential instruction-following or geometric artifacts in the MLLM's hidden states. The performance gains on the text-to-SVG retrieval task provide supporting evidence through comparison to baselines, but we agree this is indirect. In the revised version we will add t-SNE visualizations of cross-modal embeddings and an ablation contrasting mEOL's single-token constraint against standard multi-token prompting to more directly substantiate the alignment claim. revision: yes
-
Referee: [§4.3] §4.3 and Table 2 (retrieval results): The reported outperformance on the repurposed VGBench text-to-SVG task lacks error bars, statistical significance tests, and an ablation isolating the contribution of the SVG rewriting module versus the base mEOL prompting. Without these, it is unclear whether gains are robust or driven by benchmark construction choices.
Authors: We agree that the absence of error bars, significance testing, and a targeted ablation limits the strength of the empirical claims. We will recompute the retrieval metrics with multiple runs to include error bars and paired statistical significance tests in Table 2. We will also add a new ablation subsection in §4.3 that reports results for mEOL with and without the semantic SVG rewriting module (holding all other components fixed) to isolate its specific contribution. revision: yes
Circularity Check
No significant circularity; empirical claims rest on benchmark comparisons rather than self-referential derivations
full rationale
The paper describes a prompting-based method (mEOL single-token extraction plus SVG rewriting via visual reasoning) applied to an off-the-shelf MLLM, then reports retrieval performance on a repurposed VGBench benchmark against encoder and training-based baselines. No equations, fitted parameters, or derivations are presented that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims are therefore externally falsifiable through the reported empirical comparisons and do not collapse into self-definition or renaming of known results.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Multimodal LLMs can effectively summarize inputs from different modalities into a single token whose hidden state serves as a semantic embedding.
- domain assumption Visual reasoning by the MLLM over rendered SVG images can accurately assign meaningful identifiers and simplify nested elements to expose geometric cues.
invented entities (2)
-
mEOL (Multimodal Explicit One-word Limitation)
no independent evidence
-
Semantic SVG rewriting module
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Beyond the highlights: Video retrieval with salient and surrounding contexts
Jaehun Bang, Moon Ye-Bin, Tae-Hyun Oh, and Kyungdon Joo. Beyond the highlights: Video retrieval with salient and surrounding contexts. InProceedings of the Winter Con- ference on Applications of Computer Vision (WACV), 2026. 2
2026
-
[3]
Parishad BehnamGhader, Vaibhav Adlakha, Marius Mos- bach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024. 2
-
[4]
Deepicon: A hierarchical network for layer-wise icon vectorization
Qi Bing, Chaoyi Zhang, and Weidong Cai. Deepicon: A hierarchical network for layer-wise icon vectorization. In 2024 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pages 70–77. IEEE,
2024
-
[5]
Mu Cai, Zeyi Huang, Yuheng Li, Utkarsh Ojha, Haohan Wang, and Yong Jae Lee. Leveraging large language models for scalable vector graphics-driven image understanding.arXiv preprint arXiv:2306.06094, 2023. 2
-
[6]
An investigation on llms’ visual under- standing ability using svg for image-text bridging
Mu Cai, Zeyi Huang, Yuheng Li, Utkarsh Ojha, Haohan Wang, and Yong Jae Lee. An investigation on llms’ visual under- standing ability using svg for image-text bridging. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5377–5386. IEEE, 2025. 2
2025
-
[7]
Svgformer: Representation learning for continuous vector graphics using transformers
Defu Cao, Zhaowen Wang, Jose Echevarria, and Yan Liu. Svgformer: Representation learning for continuous vector graphics using transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10093–10102, 2023. 2
2023
-
[8]
Deepsvg: A hierarchical generative network for vector graphics animation.Advances in Neural Informa- tion Processing Systems, 33:16351–16361, 2020
Alexandre Carlier, Martin Danelljan, Alexandre Alahi, and Radu Timofte. Deepsvg: A hierarchical generative network for vector graphics animation.Advances in Neural Informa- tion Processing Systems, 33:16351–16361, 2020. 1
2020
-
[9]
Siqi Chen, Xinyu Dong, Haolei Xu, Xingyu Wu, Fei Tang, Hang Zhang, Yuchen Yan, Linjuan Wu, Wenqi Zhang, Guiyang Hou, et al. Svgenius: Benchmarking llms in svg understanding, editing and generation.arXiv preprint arXiv:2506.03139, 2025. 2
-
[10]
Patch- wise retrieval: A bag of practical techniques for instance- level matching
Wonseok Choi, Sohwi Lim, Nam Hyeon-Woo, Moon Ye-Bin, Dong-Ju Jeong, Jinyoung Hwang, and Tae-Hyun Oh. Patch- wise retrieval: A bag of practical techniques for instance- level matching. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), 2026. 2
2026
-
[11]
Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljaˇci´c, Shang-Wen Li, Wen-tau Yih, Yoon Kim, and James Glass. Diffcse: Difference-based contrastive learning for sentence embeddings.arXiv preprint arXiv:2204.10298, 2022. 2
-
[12]
Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings.arXiv preprint arXiv:1909.00512, 2019. 8
-
[13]
iuniverse Bloomington,
Jon Ferraiolo, Fujisawa Jun, and Dean Jackson.Scalable vec- tor graphics (SVG) 1.0 specification. iuniverse Bloomington,
-
[14]
SimCSE: Simple Contrastive Learning of Sentence Embeddings
Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings.arXiv preprint arXiv:2104.08821, 2021. 2
work page internal anchor Pith review arXiv 2021
-
[15]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Juncheng Hu, Ximing Xing, Jing Zhang, and Qian Yu. Vec- torpainter: Advanced stylized vector graphics synthesis using stroke-style priors.arXiv preprint arXiv:2405.02962, 2024. 2
-
[17]
Vectorfusion: Text- to-svg by abstracting pixel-based diffusion models
Ajay Jain, Amber Xie, and Pieter Abbeel. Vectorfusion: Text- to-svg by abstracting pixel-based diffusion models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1911–1920, 2023. 2
1911
-
[18]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lu- cile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ee Lacroix, and William El Sayed. Mistral 7b, 2023. 6
2023
-
[19]
PromptBERT: Improving BERT Sentence Embeddings with Prompts,
Ting Jiang, Jian Jiao, Shaohan Huang, Zihan Zhang, Deqing Wang, Fuzhen Zhuang, Furu Wei, Haizhen Huang, Denvy Deng, and Qi Zhang. Promptbert: Improving bert sentence embeddings with prompts.arXiv preprint arXiv:2201.04337,
-
[20]
Scaling sentence embeddings with large language models
Ting Jiang, Shaohan Huang, Zhongzhi Luan, Deqing Wang, and Fuzhen Zhuang. Scaling sentence embeddings with large language models. InFindings of the Association for Compu- tational Linguistics: EMNLP 2024, pages 3182–3196, 2024. 2, 6, 7
2024
- [21]
-
[22]
12 EmbeddingGemma: Powerful and Lightweight Text Representations C
Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jin- song Su. Llave: Large language and vision embedding models with hardness-weighted contrastive learning.arXiv preprint arXiv:2503.04812, 2025. 3, 5, 6
-
[23]
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv- embed: Improved techniques for training llms as generalist embedding models.arXiv preprint arXiv:2405.17428, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[24]
Yibin Lei, Di Wu, Tianyi Zhou, Tao Shen, Yu Cao, Chongyang Tao, and Andrew Yates. Meta-task prompting elicits em- beddings from large language models.arXiv preprint arXiv:2402.18458, 2024. 2, 7
-
[25]
Align be- fore fuse: Vision and language representation learning with momentum distillation
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align be- fore fuse: Vision and language representation learning with momentum distillation. InAdvances in Neural Information Processing Systems, pages 9694–9705. Curran Associates, Inc., 2021. 2
2021
-
[26]
Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR,
-
[27]
A learned representation for scalable vector graphics
Raphael Gontijo Lopes, David Ha, Douglas Eck, and Jonathon Shlens. A learned representation for scalable vector graphics. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7930–7939, 2019. 2
2019
-
[28]
Sfrembedding-mistral: en- hance text retrieval with transfer learning.Salesforce AI Research Blog, 3:6, 2024
Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfrembedding-mistral: en- hance text retrieval with transfer learning.Salesforce AI Research Blog, 3:6, 2024. 2
2024
-
[29]
MTEB: Massive Text Embedding Benchmark
Niklas Muennighoff, Nouamane Tazi, Lo¨ıc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark.arXiv preprint arXiv:2210.07316, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[30]
Gen- erative representational instruction tuning
Niklas Muennighoff, SU Hongjin, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Gen- erative representational instruction tuning. InICLR 2024 Workshop: How Far Are We From AGI, 2024. 2
2024
-
[31]
arXiv preprint arXiv:2108.08877 , year=
Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant, Ji Ma, Keith B Hall, Daniel Cer, and Yinfei Yang. Sentence- t5: Scalable sentence encoders from pre-trained text-to-text models.arXiv preprint arXiv:2108.08877, 2021. 2
-
[32]
Zeju Qiu, Weiyang Liu, Haiwen Feng, Zhen Liu, Tim Z Xiao, Katherine M Collins, Joshua B Tenenbaum, Adrian Weller, Michael J Black, and Bernhard Sch¨olkopf. Can large language models understand symbolic graphics programs?(2024).URL https://arxiv. org/abs/2408.08313, 57. 2
-
[33]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 5
2021
-
[34]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084, 2019. 2
work page internal anchor Pith review arXiv 1908
-
[35]
Starvector: Generating scalable vector graphics code from images.CVPR, 2025
Juan A Rodriguez, Shubham Agarwal, Issam H Laradji, Pau Rodriguez, David Vazquez, Christopher Pal, and Marco Ped- ersoli. Starvector: Generating scalable vector graphics code from images.CVPR, 2025. 2
2025
-
[36]
Yiren Song, Danze Chen, and Mike Zheng Shou. Layer- tracer: Cognitive-aligned layered svg synthesis via diffusion transformer.arXiv preprint arXiv:2502.01105, 2025. 2
-
[37]
Strokenuwa: Tokeniz- ing strokes for vector graphic synthesis.arXiv preprint arXiv:2401.17093, 2024
Zecheng Tang, Chenfei Wu, Zekai Zhang, Mingheng Ni, Shengming Yin, Yu Liu, Zhengyuan Yang, Lijuan Wang, Zicheng Liu, Juntao Li, et al. Strokenuwa: Tokeniz- ing strokes for vector graphic synthesis.arXiv preprint arXiv:2401.17093, 2024. 2
-
[38]
Raghuveer Thirukovalluru and Bhuwan Dhingra. Geneol: Harnessing the generative power of llms for training-free sen- tence embeddings.arXiv preprint arXiv:2410.14635, 2024. 2, 4, 7
-
[39]
Zhenhailong Wang, Joy Hsu, Xingyao Wang, Kuan-Hao Huang, Manling Li, Jiajun Wu, and Heng Ji. Visually de- scriptive language model for vector graphics reasoning.arXiv preprint arXiv:2404.06479, 2024. 2, 4
-
[40]
Chat2svg: Vector graphics generation with large language models and image diffusion models.CVPR, 2024
Ronghuan Wu, Wanchao Su, and Jing Liao. Chat2svg: Vector graphics generation with large language models and image diffusion models.CVPR, 2024. 2
2024
-
[41]
SVGFusion: A VAE-Diffusion Transformer for Vector Graphic Generation
Ximing Xing, Juncheng Hu, Jing Zhang, Dong Xu, and Qian Yu. Svgfusion: Scalable text-to-svg generation via vector space diffusion.arXiv preprint arXiv:2412.10437, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Svgdreamer: Text guided svg generation with diffusion model
Ximing Xing, Haitao Zhou, Chuang Wang, Jing Zhang, Dong Xu, and Qian Yu. Svgdreamer: Text guided svg generation with diffusion model. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 4546–4555, 2024. 2
2024
-
[43]
Omnisvg: A unified scalable vector graphics generation model.arXiv preprint arXiv:2504.06263, 2025
Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, and Yu-Gang Jiang. Omnisvg: A unified scalable vector graphics generation model.arXiv preprint arXiv:2504.06263, 2025. 2
-
[44]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 2, 5
2023
-
[45]
Simple tech- niques for enhancing sentence embeddings in generative lan- guage models
Bowen Zhang, Kehua Chang, and Chunping Li. Simple tech- niques for enhancing sentence embeddings in generative lan- guage models. InInternational Conference on Intelligent Computing, pages 52–64. Springer, 2024. 2, 4, 6, 7
2024
-
[46]
Vg- bench: Evaluating large language models on vector graphics understanding and generation.CoRR, 2024
Bocheng Zou, Mu Cai, Jianrui Zhang, and Yong Jae Lee. Vg- bench: Evaluating large language models on vector graphics understanding and generation.CoRR, 2024. 2
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.