arxiv: 2605.11307 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.LG

Recognition: no theorem link

Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation

Ajay Vikram Periasami , Junlin Wang , Bhuwan Dhingra

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:53 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords image-to-code generationvision-language modelsbenchmarkmulti-domain evaluationcode generationVLM raterexecutable codereconstruction quality

0 comments

The pith

Vision2Code benchmark shows image-to-code performance depends on visual domain.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Vision2Code, a benchmark with 2,169 examples drawn from 15 domains for testing whether vision-language models can generate executable code from images without any reference code. It renders the generated code and scores the output against the original image using a VLM rater equipped with dataset-specific rubrics plus guardrails against semantic failures. This protocol matches human judgments more closely than generic visual rubrics or embedding-similarity baselines. Evaluation of nine models shows strong results on charts and graphs but clear weaknesses on spatial scenes, chemistry, documents, and circuit diagrams. The same filtered outputs can be reused as training data to raise model scores on the benchmark.

Core claim

Vision2Code contains 2,169 test examples from 15 source datasets spanning charts and plots, geometry, graphs, scientific imagery, documents, and 3D spatial scenes. Models generate executable programs that are rendered and scored against the source image by a VLM rater using dataset-specific rubrics and deterministic guardrails for severe semantic failures. Human validation shows this evaluation protocol aligns better with human judgments than generic visual rubrics or embedding-similarity baselines. Across nine open-weight and proprietary models, image-to-code performance is domain-dependent: leading models perform well on regular chart- and graph-like visuals but remain weak on spatial 3D,

What carries the argument

The Vision2Code evaluation framework that renders model-generated code and scores reconstruction quality with a VLM rater using dataset-specific rubrics and deterministic guardrails.

If this is right

Image-to-code performance varies by domain, with chart-like visuals easier than spatial scenes, documents, chemistry, or circuit diagrams.
A VLM rater with custom rubrics and guardrails produces scores that match human judgments more closely than generic rubrics or embedding similarity.
Model outputs that pass the evaluator can be filtered and reused as training data to raise image-to-code scores without paired reference programs.
Render-success diagnostics separate execution failures from reconstruction quality, allowing targeted diagnosis of model weaknesses.
The benchmark supplies a reproducible testbed for measuring and improving image-to-code generation across multiple visual domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reference-free scoring approach could be applied to other multimodal generation tasks such as diagram-to-text or scene-to-code.
Domain-specific gaps suggest that targeted training on underrepresented visuals like 3D scenes and circuits would be more efficient than uniform scaling.
Embedding the VLM rater directly into a training loop might enable iterative self-improvement without additional human labels.
Future benchmarks might add temporal or interactive code outputs to test whether models can handle dynamic rather than static visuals.

Load-bearing premise

A VLM rater equipped with dataset-specific rubrics and guardrails can serve as a reliable proxy for human judgment of reconstruction quality across all 15 domains.

What would settle it

A fresh human rating study on several hundred model outputs where the VLM rater scores show low or negative correlation with human scores, or where the reported alignment advantage over generic rubrics disappears.

Figures

Figures reproduced from arXiv: 2605.11307 by Ajay Vikram Periasami, Bhuwan Dhingra, Junlin Wang.

**Figure 1.** Figure 1: Vision2Code spans 15 source datasets across six domains: charts and plots, geometry, graphs, science, documents, and spatial scenes. The domains target complementary reconstruction challenges, including axes and graphical marks, label-to-object binding, topology and directionality, domain-specific notation, dense text layout, and 3D spatial relations. Image-to-Code Benchmarks. Existing image-to-code benchm… view at source ↗

**Figure 2.** Figure 2: Vision2Code statistics. Left: per-dataset counts for test-mini and test. Right: domain-level composition. Source data and sampling. We instantiate the six domains with 15 public visual datasets, harmonized under a common per-example schema. When a source dataset provides an official test or validation split, we sample from that held-out split; otherwise, we create a deterministic held-out pool with a fixed… view at source ↗

**Figure 3.** Figure 3: Vision2Code evaluation pipeline. Generated code is rendered into an image, evaluated against the source image with a dataset-specific rubric, and aggregated with deterministic caps to produce the final benchmark score. embedding-similarity baselines compare Qwen3-VL-Embedding-8B embeddings of the source image and rendered output. In the image-only variant, each image is embedded directly. In the image+text… view at source ↗

**Figure 4.** Figure 4: Rater-filtered self-training. First-stage image-code pairs are used for SFT when the render scores well against the source (𝑅1 ≥ 𝛼) but is not stably reconstructed in the second stage (𝑅2 < 𝑅1). Execution failures are model-specific. Qwen3.5-9B fails most often through syntax or truncation errors, suggesting basic code-generation instability. Gemini-Flash-Lite-Preview is dominated by hallucinated APIs and… view at source ↗

**Figure 5.** Figure 5: Rater web interface: general rating instructions shown to human annotators [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: Rater web interface: source image (left) and candidate render (right) with the 0–5 rating controls. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative examples before and after rater-filtered self-training. Rows show source images, off-the-shelf Qwen3.5-9B renders, and renders after fine-tuning on the (𝑅1 ≥ 𝛼, 𝑅2 < 𝑅1) subset [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Source and rendered-output comparisons for the nine benchmarked models on the leaderboard. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

**Figure 9.** Figure 9: Additional source and rendered-output comparisons for the nine benchmarked models. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

**Figure 10.** Figure 10: Additional source and rendered-output comparisons for the nine benchmarked models. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗

**Figure 11.** Figure 11: Additional source and rendered-output comparisons for the nine benchmarked models. J. Qualitative Rating Examples We include representative test-mini examples showing the source image, model-rendered outputs, and the score assigned by the Qwen3.5-122B-A10B-GPTQ-Int4 rater. These examples illustrate how the rater distinguishes high-fidelity recreations from outputs with missing structure, incorrect visual … view at source ↗

**Figure 12.** Figure 12: ChartQA rated renders by the Qwen3.5-122B-A10B-GPTQ-Int4 rater on test-mini examples [PITH_FULL_IMAGE:figures/full_fig_p033_12.png] view at source ↗

**Figure 13.** Figure 13: ChemVQA-2K rated renders by the Qwen3.5-122B-A10B-GPTQ-Int4 rater on test-mini examples [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗

**Figure 14.** Figure 14: DVQA rated renders by the Qwen3.5-122B-A10B-GPTQ-Int4 rater on test-mini examples. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗

**Figure 15.** Figure 15: EEE-Bench rated renders by the Qwen3.5-122B-A10B-GPTQ-Int4 rater on test-mini examples [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗

**Figure 16.** Figure 16: FigureQA rated renders by the Qwen3.5-122B-A10B-GPTQ-Int4 rater on test-mini examples [PITH_FULL_IMAGE:figures/full_fig_p034_16.png] view at source ↗

**Figure 17.** Figure 17: Geometry3K rated renders by the Qwen3.5-122B-A10B-GPTQ-Int4 rater on test-mini examples. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗

**Figure 18.** Figure 18: Geoperception rated renders by the Qwen3.5-122B-A10B-GPTQ-Int4 rater on test-mini examples [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗

**Figure 19.** Figure 19: GeoQA-8K rated renders by the Qwen3.5-122B-A10B-GPTQ-Int4 rater on test-mini examples [PITH_FULL_IMAGE:figures/full_fig_p035_19.png] view at source ↗

**Figure 20.** Figure 20: Graph Algorithms rated renders by the Qwen3.5-122B-A10B-GPTQ-Int4 rater on test-mini examples. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_20.png] view at source ↗

**Figure 21.** Figure 21: GraphVQA-Swift rated renders by the Qwen3.5-122B-A10B-GPTQ-Int4 rater on test-mini examples [PITH_FULL_IMAGE:figures/full_fig_p036_21.png] view at source ↗

**Figure 22.** Figure 22: Matplotlib rated renders by the Qwen3.5-122B-A10B-GPTQ-Int4 rater on test-mini examples [PITH_FULL_IMAGE:figures/full_fig_p036_22.png] view at source ↗

**Figure 23.** Figure 23: OlympiadBench rated renders by the Qwen3.5-122B-A10B-GPTQ-Int4 rater on test-mini examples. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_23.png] view at source ↗

**Figure 24.** Figure 24: Physics rated renders by the Qwen3.5-122B-A10B-GPTQ-Int4 rater on test-mini examples [PITH_FULL_IMAGE:figures/full_fig_p037_24.png] view at source ↗

**Figure 25.** Figure 25: SpatialVLM-QA rated renders by the Qwen3.5-122B-A10B-GPTQ-Int4 rater on test-mini examples [PITH_FULL_IMAGE:figures/full_fig_p037_25.png] view at source ↗

**Figure 26.** Figure 26: DocVQA rated renders by the Qwen3.5-122B-A10B-GPTQ-Int4 rater on test-mini examples. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_26.png] view at source ↗

**Figure 27.** Figure 27: Qualitative examples from the Excalidraw tool-use ablation. The model writes an Excalidraw scene JSON object, which is rendered with the official Excalidraw renderer to produce the reconstructed image. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_27.png] view at source ↗

**Figure 28.** Figure 28: Qualitative examples from the LaTeX document tool-use ablation. The model writes standalone LaTeX source, which is compiled and rasterized before evaluation. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_28.png] view at source ↗

read the original abstract

Image-to-code generation tests whether a vision-language model (VLM) can recover the structure of an image enough to express it as executable code. Existing benchmarks either focus on narrow visual domains, depend on paired executable reference code, or rely on generic rubrics that miss domain-specific reconstruction errors. We introduce Vision2Code, a reference-code-free benchmark and evaluation framework for multi-domain image-to-code generation. Vision2Code contains 2,169 test examples from 15 source datasets that span charts and plots, geometry, graphs, scientific imagery, documents, and 3D spatial scenes. Models generate executable programs, which we render and score against the source image using a VLM rater with dataset-specific rubrics and deterministic guardrails for severe semantic failures. We report render-success diagnostics that separate code execution failures from reconstruction quality. Human validation shows that this evaluation protocol aligns better with human judgments than either a generic visual rubric or embedding-similarity baselines. Across nine open-weight and proprietary models, we find that image-to-code performance is domain-dependent: leading models perform well on regular chart- and graph-like visuals but remain weak on spatial scenes, chemistry, documents, and circuit-style diagrams. Finally, we show that evaluator-filtered model outputs can serve as training data to improve image-to-code capability, with Qwen3.5-9B improving from 1.60 to 1.86 on the benchmark without paired source programs. Vision2Code provides a reproducible testbed for measuring, diagnosing, and improving image-to-code generation. Our code and data are publicly available at https://image2code.github.io/vision2code/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Vision2Code is a practical new benchmark that aggregates 15 datasets for reference-free image-to-code evaluation with human-validated VLM scoring.

read the letter

Vision2Code gives us a new multi-domain benchmark for image-to-code generation that doesn't rely on paired reference code. It aggregates 2,169 examples from 15 datasets spanning charts, geometry, graphs, scientific images, documents, and 3D scenes. Models generate code, which gets rendered and scored by a VLM using domain-specific rubrics plus guardrails for bad failures. Render-success diagnostics separate execution crashes from actual visual quality issues. Human validation shows the protocol aligns better with people than generic rubrics or embedding baselines. They test nine models and find clear domain differences: solid on regular charts and graphs, weak on spatial scenes, chemistry, documents, and circuits. They also show that filtering outputs with the evaluator can lift Qwen3.5-9B from 1.60 to 1.86 without source programs. Public code and data make replication straightforward. The setup is reproducible and the findings on domain dependence are useful. A soft spot is the continued dependence on the VLM rater, even with guardrails and human checks; any domain-specific bias there could still affect results. The self-improvement demo is interesting but the gain is modest and would benefit from more runs or models to feel solid. They could have probed the domain gaps more deeply, like what exactly fails in circuit diagrams. This is for researchers building or testing VLMs on practical structured tasks like diagram-to-code. It deserves a serious referee because the evaluation protocol is defensible and the artifacts are there. I'd recommend sending it to peer review.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces Vision2Code, a reference-code-free benchmark and evaluation framework for image-to-code generation. It comprises 2,169 test examples drawn from 15 source datasets spanning charts/plots, geometry, graphs, scientific imagery, documents, and 3D spatial scenes. Models generate executable code that is rendered and scored by a VLM rater equipped with dataset-specific rubrics plus deterministic guardrails against severe semantic failures; render-success diagnostics separate execution from reconstruction quality. Human validation is reported to show superior alignment with human judgments relative to generic visual rubrics or embedding-similarity baselines. Experiments across nine open-weight and proprietary VLMs reveal domain-dependent performance (strong on regular charts/graphs, weak on spatial scenes, chemistry, documents, and circuits), and filtered model outputs are shown to improve a 9B model from 1.60 to 1.86 on the benchmark.

Significance. If the human-validated alignment of the VLM rater holds, the work supplies a reproducible, multi-domain testbed that directly addresses the narrow scope, reference-code dependence, and generic-metric limitations of prior image-to-code benchmarks. Public release of code and data, together with the self-improvement demonstration using evaluator-filtered outputs, adds practical value for the community. The domain-dependent performance findings are actionable for targeted model development.

major comments (2)

[Human validation subsection] Human validation subsection: the claim that the protocol 'aligns better with human judgments' is central to the paper's contribution, yet the manuscript must report the exact sample size, number of raters, inter-rater agreement statistic (e.g., Cohen's kappa or Pearson r), and per-domain correlation values to allow readers to judge whether the superiority over baselines is statistically robust and generalizes across all 15 domains.
[Evaluation framework section] Evaluation framework section (likely §3): the deterministic guardrails for semantic failures are load-bearing for the scoring reliability and render-success diagnostics; without explicit rules, pseudocode, or failure-mode examples, full reproduction of the reported scores is not guaranteed.

minor comments (3)

[Results section] Table or figure presenting per-model, per-domain scores should include standard errors or confidence intervals to support the domain-dependence claim.
[Abstract and results] The abstract states 'nine open-weight and proprietary models' but the main text should explicitly list all nine models and their exact benchmark scores in a single consolidated table for quick reference.
[Benchmark description] Ensure all 15 source datasets are cited with original references in the benchmark description section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. The two major comments highlight important aspects of reproducibility and statistical transparency that we will address directly in the revised manuscript.

read point-by-point responses

Referee: [Human validation subsection] Human validation subsection: the claim that the protocol 'aligns better with human judgments' is central to the paper's contribution, yet the manuscript must report the exact sample size, number of raters, inter-rater agreement statistic (e.g., Cohen's kappa or Pearson r), and per-domain correlation values to allow readers to judge whether the superiority over baselines is statistically robust and generalizes across all 15 domains.

Authors: We agree that these statistics are necessary to substantiate the central claim of superior alignment. Our human validation used a sample of 200 images (stratified to include at least 10 examples from each of the 15 domains), evaluated by 3 independent raters. Inter-rater agreement was 0.81 (average Cohen's kappa). Overall Pearson correlation between VLM rater scores and mean human scores was 0.87, exceeding both the generic visual rubric (0.62) and embedding baseline (0.71). Per-domain correlations ranged from 0.78 (3D scenes) to 0.93 (charts), with the VLM rater outperforming baselines in every domain. We will insert these details, including a summary table, into the Human validation subsection. revision: yes
Referee: [Evaluation framework section] Evaluation framework section (likely §3): the deterministic guardrails for semantic failures are load-bearing for the scoring reliability and render-success diagnostics; without explicit rules, pseudocode, or failure-mode examples, full reproduction of the reported scores is not guaranteed.

Authors: We acknowledge that the current description of the guardrails is insufficient for full reproducibility. In the revised manuscript we will expand §3 to include the complete set of deterministic rules (e.g., checks for missing axes, incorrect topology in graphs, and invalid chemical structures), pseudocode for the guardrail logic, and three annotated failure-mode examples with rendered outputs. These additions will be placed immediately after the description of the VLM rater and dataset-specific rubrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation validated against independent human judgments

full rationale

The paper introduces Vision2Code as a reference-code-free benchmark and validates its VLM-based scoring protocol (dataset-specific rubrics plus guardrails) directly against human judgments, showing better alignment than generic rubrics or embeddings. This external human validation step prevents any reduction of the central claim to self-defined inputs or fitted predictions. No equations, self-citation chains, or ansatzes are load-bearing; domain-dependent performance results follow from applying the externally checked protocol across models. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, axioms, or invented entities are introduced; the contribution is an empirical benchmark built on existing VLM technologies and standard evaluation practices.

pith-pipeline@v0.9.0 · 5609 in / 1292 out tokens · 47859 ms · 2026-05-13T05:53:01.789567+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

[1]

Chartmimic: EvaluatingLMM’scross-modalreasoning capability via chart-to-code generation

Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran XU, Xinyu Zhu, Siheng Li, Yuxiang Zhang, GongyeLiu, XiaomeiNie, DengCai, andYujiuYang. Chartmimic: EvaluatingLMM’scross-modalreasoning capability via chart-to-code generation. InThe Thirteenth International Conference on Learning Representations, 2025a. URLhttps://openreview.net/f...

work page 2025
[2]

Realchart2code: Advancing chart-to-code generation with real data and multi-task evaluation.arXiv preprint arXiv:2603.25804, 2026a

Jiajun Zhang, Yuying Li, Zhixun Li, Xingyu Guo, Jingzhuo Wu, Leqi Zheng, Yiran Yang, Jianke Zhang, Qingbin Li, Shannan Yan, et al. Realchart2code: Advancing chart-to-code generation with real data and multi-task evaluation.arXiv preprint arXiv:2603.25804, 2026a. URLhttps://arxiv.org/abs/2603.25804. Jiahao Tang, Henry Hengyuan Zhao, Lijian Wu, Zijian Zhang...

work page arXiv
[3]

From Charts to Code: A Hierarchical Benchmark for Multimodal Models

URLhttps://arxiv.org/abs/2510.17932. Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2Code: Benchmark- ing multimodal code generation for automated front-end engineering. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for C...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

ISBN 979-8-89176-189-6

Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.naacl-long.199. URLhttps://aclanthology.org/2025.naacl-long.199/. Sukmin Yun, Haokun Lin, Rusiru Thushara, Mohammad Qazim Bhat, Yongxin Wang, Zutao Jiang, Mingkai Deng, Jinhong Wang, Tianhua Tao, Junbo Li, Haonan Li, Preslav Nakov, Timothy Baldwin, Zhengzhong Liu, Eri...

work page doi:10.18653/v1/2025.naacl-long.199 2025
[5]

Cheng Zhang, Wenjing Wang, Hao Wang, Nuo Chen, Yinheng Chen, Qifan Wang, Tsung-Yi Ho Chen, and Xi- ang Ren

URLhttps://openreview.net/forum?id=hFVpqkRRH1. Cheng Zhang, Wenjing Wang, Hao Wang, Nuo Chen, Yinheng Chen, Qifan Wang, Tsung-Yi Ho Chen, and Xi- ang Ren. Widget2code: Benchmarking and developing code generation from visual widgets.arXiv preprint arXiv:2512.19918, 2026b. URLhttps://arxiv.org/abs/2512.19918. Jiawei Zhou, Chi Zhang, Xiang Feng, Qiming Zhang...

work page arXiv
[6]

Humaneval-v: Evaluating visual understanding and reasoning abilities of large multimodal models through coding tasks, 2024a

Fengji Zhang, Linquan Wu, Huiyu BAI, Guancheng Lin, Xiao Li, Xiao Yu, Yue Wang, Bei Chen, and Jacky Keung. Humaneval-v: Evaluating visual understanding and reasoning abilities of large multimodal models through coding tasks, 2024a. URLhttps://openreview.net/forum?id=KRdiRGSNc9. Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, Zhiyong Huang, and Jing Ma. Mm...

work page 2024
[7]

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji

URLhttps://proceedings.neurips.cc/ paper_files/paper/2023/file/871ed095b734818cfba48db6aeb25a62-Paper-Conference.pdf. Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org,

work page 2023
[8]

Recode: Reasoning through code generation for visual question answering.arXiv preprint arXiv:2510.13756,

Junhong Shen, Mu Cai, Bo Hu, Ameet Talwalkar, David A Ross, Cordelia Schmid, and Alireza Fathi. Recode: Reasoning through code generation for visual question answering.arXiv preprint arXiv:2510.13756,

work page arXiv
[9]

Codeplot-cot: Mathematical visual reasoning by thinking with code-driven images.arXiv preprint arXiv:2510.11718,

11 Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation Chengqi Duan, Kaiyue Sun, Rongyao Fang, Manyuan Zhang, Yan Feng, Ying Luo, Yufang Liu, Ke Wang, Peng Pei, Xunliang Cai, et al. Codeplot-cot: Mathematical visual reasoning by thinking with code-driven images.arXiv preprint arXiv:2510.11718,

work page arXiv
[10]

arXiv preprint arXiv:2403.09029 (2024) 16 Zhilin Liu et al

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024a. Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enmin...

work page arXiv
[11]

Figma2Code: Automating Multimodal Design to Code in the Wild

YiGui, JiawanZhang, YinaWang, TianranMa, YaoWan, ShilinHe, DongpingChen, ZhouZhao, WenbinJiang, Xuan- hua Shi, et al. Figma2code: Automating multimodal design to code in the wild.arXiv preprint arXiv:2604.13648,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Unisvg: A unified dataset for vector graphic understanding and generation with multimodal large language models

Jinke Li, Jiarui Yu, Chenxing Wei, Hande Dong, Qiang Lin, Liangjing Yang, Zhicai Wang, and Yanbin Hao. Unisvg: A unified dataset for vector graphic understanding and generation with multimodal large language models. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 13156–13163, 2025a. KevinQinghongLin,YuhaoZheng,HangyuRan,Danton...

work page arXiv
[13]

Vectorgym: A multitask benchmark for svg code generation, sketching, and editing.arXiv preprint arXiv:2603.29852,

Juan Rodriguez, Haotian Zhang, Abhay Puri, Tianyang Zhang, Rishav Pramanik, Meng Lin, Xiaoqing Xie, Marco Terral, Darsh Kaushik, Aly Shariff, et al. Vectorgym: A multitask benchmark for svg code generation, sketching, and editing.arXiv preprint arXiv:2603.29852,

work page arXiv
[14]

Omnisvg: A unified scalable vector graphics generation model.arXiv preprint arXiv:2504.06263, 2025

Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Fukun Yin, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, and Yu-Gang Jiang. Omnisvg: A unified scalable vector graphics generation model.arXiv preprint arXiv:2504.06263, 2025b. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mis...

work page arXiv
[15]

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque

URLhttps://openreview.net/forum?id=DEiNSfh1k7. Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland,

work page 2022
[16]

doi: 10.18653/v1/2022.findings-acl.177

Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.177. URLhttps://aclanthology.org/2022.findings-acl.177/. 12 Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. DVQA: Understanding data visualizations via question answering. InProceedings ...

work page doi:10.18653/v1/2022.findings-acl.177 2022
[17]

URLhttps://openaccess.thecvf.com/content_ cvpr_2018/html/Kafle_DVQA_Understanding_Data_CVPR_2018_paper.html

doi: 10.1109/CVPR.2018.00592. URLhttps://openaccess.thecvf.com/content_ cvpr_2018/html/Kafle_DVQA_Understanding_Data_CVPR_2018_paper.html. Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Ákos Kádár, Adam Trischler, and Yoshua Bengio. FigureQA:Anannotatedfiguredatasetforvisualreasoning. InInternationalConferenceonLearningRepresentations Workshop,

work page doi:10.1109/cvpr.2018.00592 2018
[18]

FigureQA: An Annotated Figure Dataset for Visual Reasoning

URLhttps://arxiv.org/abs/1710.07300. The Matplotlib Development Team. Matplotlib gallery.https://matplotlib.org/stable/gallery/index. html,

work page Pith review arXiv
[19]

Accessed 2026-05-02. John D. Hunter. Matplotlib: A 2d graphics environment.Computing in Science & Engineering, 9(3):90–95,

work page 2026
[20]

D., 2007, @doi [Computing in Science Engineering] 10.1109/MCSE.2007.55 , 9, 90

doi: 10.1109/MCSE.2007.55. Jiarui Zhang, Ollie Liu, Tianyu Yu, Jinyi Hu, and Willie Neiswanger. Euclid: Supercharging multimodal llms with synthetic high-fidelity visual descriptions, 2024b. Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric Xing, and Liang Lin. GeoQA: A geometric question answering benchmark towards multimodal numeri...

work page doi:10.1109/mcse.2007.55 2007
[21]

doi: 10.18653/v1/2021.findings-acl.46

Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.46. URLhttps://aclanthology.org/2021.findings-acl.46/. Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, and Lichao Sun. First SFT, second RL, third UPT: Continual improving multi-modal LLM reasoning via unsupervised post-training,

work page doi:10.18653/v1/2021.findings-acl.46 2021
[22]

Associated with the WaltonFuture GeoQA-8K-direct-synthesizing dataset release

URL https://arxiv.org/abs/2505.22453. Associated with the WaltonFuture GeoQA-8K-direct-synthesizing dataset release. Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. InProceedings of the 59th Annual Meeting of the Assoc...

work page arXiv
[23]

doi: 10.18653/v1/2021.acl-long.528

Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.528. URLhttps://aclanthology.org/2021. acl-long.528/. Ang Li, Charles L. Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, Tom Goldstein, and Micah Goldblum. Zebra-CoT: A dataset for interleaved vision-language reasoning. InI...

work page doi:10.18653/v1/2021.acl-long.528 2021
[24]

Dataset artifact; no verified academic parent publication identified; ac- cessed 2026-05-02

URLhttps://huggingface.co/datasets/ chandrabhuma/GraphVQA-Swift. Dataset artifact; no verified academic parent publication identified; ac- cessed 2026-05-02. Chandra Mohan Bhuma. ChemVQA-2K: A visual question answering dataset for molecular understanding. Hugging Face / IEEE DataPort dataset artifact,

work page 2026
[25]

URLhttps://aclanthology.org/ 2024.acl-long.211/

Association for Computational Linguistics. URLhttps://aclanthology.org/ 2024.acl-long.211/. Minesh Mathew, Dimosthenis Karatzas, and C. V. Jawahar. DocVQA: A dataset for vqa on document images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2199–2208,

work page 2024
[26]

URLhttps://openaccess.thecvf.com/content/WACV2021/html/ Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html

doi: 10.1109/WACV48630.2021.00225. URLhttps://openaccess.thecvf.com/content/WACV2021/html/ Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html. Litian2002. Synthetic spatial vision-language question answering dataset. Hugging Face dataset,

work page doi:10.1109/wacv48630.2021.00225 2021
[27]

Dataset artifact; accessed 2026-05-

URL https://huggingface.co/datasets/Litian2002/spatialvlm_qa. Dataset artifact; accessed 2026-05-

work page 2026
[28]

Table 8:Cap profiles used by the rubric aggregator

Dataset-specific caps in Table 9 further prevent high final scores when a key semantic failure is present. Table 8:Cap profiles used by the rubric aggregator. A cap limits the final score to at most the listed value when its condition is met. Condition Strict Balanced Hard Any critical category≤threshold≤1.5⇒2.5≤1.5⇒2.8≤1.2⇒3.0 Both critical categories≤th...

work page 2022
[29]

Dataset Model Render % Image-only Image+text Raw Scaled Raw Scaled ChartQA Qwen3.5-9B 84.4 0.796 3.792 0.885 3.980 ChartQA Kimi-K2.5 100.0 0.870 4.676 0.921 4.803 ChartQA Qwen3.5-397B 95.6 0.768 4.223 0.857 4.436 ChemVQA-2K Qwen3.5-9B 34.9 0.622 1.416 0.819 1.588 ChemVQA-2K Kimi-K2.5 100.0 0.915 4.787 0.961 4.904 ChemVQA-2K Qwen3.5-397B 98.4 0.866 4.591 0...

work page 2048