pith. sign in

arxiv: 2510.17932 · v4 · submitted 2025-10-20 · 💻 cs.SE · cs.AI

From Charts to Code: A Hierarchical Benchmark for Multimodal Models

Pith reviewed 2026-05-18 06:04 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords chart to codemultimodal modelsbenchmarkcode generationchart understandinglarge language modelsvisual fidelityhierarchical evaluation
0
0 comments X

The pith

Current top multimodal models average only 0.57 on code correctness when turning charts into code and 0.22 on visual quality for edits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Chart2Code, a three-level benchmark built from a user-driven view to test how well large multimodal models understand charts and produce the matching code. Level 1 asks models to reproduce a chart from a reference image and query, Level 2 requires complex edits such as changing chart type or adding elements, and Level 3 asks models to create charts from long, dense tables according to instructions. The benchmark supplies 2023 tasks spanning 22 chart types together with metrics that score both the generated code and the visual quality of the rendered result. When 25 leading models were tested, even GPT-5 reached only 0.57 on code evaluation and 0.22 on chart quality for the editing tasks. Readers should care because these gaps directly affect everyday tools that rely on accurate chart generation from visual or tabular data.

Core claim

Chart2Code is the first hierarchical benchmark that reflects practical chart-to-code usage by scaling task complexity across three levels: chart reproduction from a reference figure, complex chart editing, and long-table to chart generation. It comprises 2023 tasks across 22 chart types and supplies multi-level metrics that assess both code correctness and the visual fidelity of rendered charts. Benchmarking 25 state-of-the-art LMMs shows that even the strongest model, GPT-5, averages only 0.57 on code-based evaluation and 0.22 on chart-quality assessment across the editing tasks, thereby demonstrating the difficulty of these real-world scenarios.

What carries the argument

The Chart2Code benchmark organized as a three-level hierarchy of tasks with multi-level metrics that separately score code correctness and the visual quality of rendered charts.

If this is right

  • Models must develop stronger multimodal reasoning to handle chart modifications without losing visual accuracy.
  • The progressive levels allow systematic tracking of progress on increasingly complex chart-to-code workflows.
  • Emphasis on visual fidelity metrics pushes development toward outputs that match user intent in data presentation.
  • The benchmark supports training and evaluation of more general-purpose multimodal models for visualization tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The low scores on table-to-chart generation point to a possible need for better integration of dense textual data with visual output generation.
  • Extending the benchmark with interactive follow-up edits could reveal whether models maintain consistency across sequential changes.
  • Specialized fine-tuning on paired chart-image and code examples might close the performance gap shown by current models.

Load-bearing premise

The chosen tasks and metrics accurately reflect the practical real-world scenarios in which users convert charts to code.

What would settle it

A model that consistently scores above 0.85 on both code correctness and chart-quality metrics across all three levels would indicate that the reported difficulty is no longer representative.

Figures

Figures reproduced from arXiv: 2510.17932 by Alex Jinpeng Wang, Dongxing Mao, Henry Hengyuan Zhao, Jiahao Tang, Jingru Tan, Lijian Wu, Min Li, Min Zeng, Yang Wan, Yifei Tao, Zijian Zhang.

Figure 1
Figure 1. Figure 1: Chart2Code covers three progressively challenging levels: reproduction, editing, and long-table to chart generation. It provides a user-driven and diverse benchmark that better reflects real-world chart2code demands. performance and real-world utility, highlighting the need for a benchmark that more comprehensively reflects everyday chart2code challenges. Motivated by this observation, we introduce Chart2C… view at source ↗
Figure 2
Figure 2. Figure 2: Collected charts distribution. Level 1 (Chart Reproduction) 44.56% Level 2 (Chart Editing) 47.70% Level 3 (LT2Chart) 7.74% Bar 13.5% 1 Combination 9.6% 2 3d 8.6% 3 Radar 7.5% 4 Heatmap 6.7% 5 6.7% Line 6 Scatter 6.3% 7 Violin 4.9% 8 Contour 4.6% 9 Graph 4.6% 10 Errorbar 4.0% 11 Pie 3.6% Quiver 12 3.3% 13 Box 2.9% 14 Errorpoint 2.4% 15 Hist 2.4% 16 Area 2.1% 17 Hr 2.1% 18 Density 1.9% 19 Tree 1.0% 20 Multid… view at source ↗
Figure 3
Figure 3. Figure 3: Correlation of the model performance (i.e, LMM-score) on different manually annotated [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Left: Both proprietary and open-source models generalize well on Level 1 and Level 2 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Timestamps distribution of chart sources from arxiv preprint. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Analysis of model performance on different task cases with LLM-score and LMM-score. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Human vs model performance: LLM-score and LMM-score across level 1 direct tasks. [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Human vs model performance: LLM-score and LMM-score across level 1 customize tasks. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Human vs model performance: LLM-score and LMM-score across level 1 figure tasks. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Human vs model performance: LLM-score and LMM-score across level 2 tasks. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Human vs model performance: LLM-score and LMM-score across level 3 tasks. [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Selected charts of the Chart2Code. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_12.png] view at source ↗
read the original abstract

We introduce Chart2Code, a new benchmark for evaluating the chart understanding and code generation capabilities of large multimodal models (LMMs). Chart2Code is explicitly designed from a user-driven perspective, capturing diverse real-world scenarios and progressively increasing task difficulty. It consists of three levels: Level 1 (Chart Reproduction) reproduces charts from a reference figure and user query; Level 2 (Chart Editing) involves complex modifications such as changing chart types or adding elements; and Level 3 (Long-Table to Chart Generation) requires models to transform long, information-dense tables into faithful charts following user instructions. To our knowledge, this is the first hierarchical benchmark that reflects practical chart2code usage while systematically scaling task complexity. In total, Chart2Code contains 2,023 tasks across 22 chart types, paired with multi-level evaluation metrics that assess both code correctness and the visual fidelity of rendered charts. We benchmark 25 state-of-the-art (SoTA) LMMs, including both proprietary and the latest open-source models such as GPT-5, Qwen2.5-VL, InternVL3/3.5, MiMo-VL, and Seed-1.6-VL. Experimental results demonstrate that even the SoTA model GPT-5 averages only 0.57 on code-based evaluation and 0.22 on chart-quality assessment across the editing tasks, underscoring the difficulty of Chart2Code. We anticipate this benchmark will drive advances in multimodal reasoning and foster the development of more robust and general-purpose LMMs. Our code and data are available on Chart2Code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Chart2Code, a hierarchical benchmark for multimodal models on chart-to-code tasks. It comprises three levels of increasing difficulty—Level 1 (Chart Reproduction), Level 2 (Chart Editing), and Level 3 (Long-Table to Chart Generation)—totaling 2,023 tasks across 22 chart types, constructed from a user-driven perspective. The authors evaluate 25 LMMs (including GPT-5, Qwen2.5-VL, and others) using multi-level metrics for code correctness and visual fidelity of rendered charts, reporting that even GPT-5 achieves only 0.57 on code-based evaluation and 0.22 on chart-quality assessment for editing tasks.

Significance. If the benchmark construction, task validity, and metric reliability hold, the work would be a useful contribution as the first explicitly hierarchical, user-driven benchmark for chart understanding and code generation. The public availability of code and data, the scale (2,023 tasks), and the concrete performance numbers across 25 models (both proprietary and open-source) provide a reproducible starting point for tracking progress in multimodal reasoning.

major comments (2)
  1. [Evaluation Metrics (likely §4)] The chart-quality metric procedure is not described. The abstract states that the multi-level metrics assess 'visual fidelity of rendered charts,' yet the manuscript provides no details on the evaluation method (human rating, automated perceptual similarity such as LPIPS/CLIP, or execution-based comparison after rendering). This directly affects interpretability of the headline result that GPT-5 averages 0.22 on chart-quality for editing tasks; without inter-rater agreement statistics (if human) or correlation with human judgments (if automated), it is unclear whether low scores reflect model limitations or metric noise.
  2. [Benchmark Construction (likely §3)] Details on task construction, validation, and potential biases are insufficient to support the claim that the benchmark 'accurately capture[s] practical real-world chart-to-code usage scenarios.' The abstract mentions a 'user-driven design perspective' and 'progressively increasing task difficulty,' but the manuscript does not report how tasks were sourced or validated (e.g., pilot studies, expert review, or checks for annotation artifacts in the 2023 tasks). This is load-bearing for the central difficulty claim.
minor comments (1)
  1. [Abstract and §3] The breakdown of the 2,023 tasks across the three levels and 22 chart types is not provided in the abstract or early sections; adding a table or explicit counts would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below. Where the comments identify areas needing greater clarity, we have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Evaluation Metrics (likely §4)] The chart-quality metric procedure is not described. The abstract states that the multi-level metrics assess 'visual fidelity of rendered charts,' yet the manuscript provides no details on the evaluation method (human rating, automated perceptual similarity such as LPIPS/CLIP, or execution-based comparison after rendering). This directly affects interpretability of the headline result that GPT-5 averages 0.22 on chart-quality for editing tasks; without inter-rater agreement statistics (if human) or correlation with human judgments (if automated), it is unclear whether low scores reflect model limitations or metric noise.

    Authors: We thank the referee for highlighting this important point. We agree that the description of the chart-quality assessment procedure in the original manuscript was insufficiently detailed, which limits interpretability of results such as the 0.22 score. The evaluation combines automated perceptual similarity measures on rendered outputs with human ratings on a sampled subset. We have revised §4 to provide a complete description of the procedure, including the specific metrics employed, the human evaluation protocol, inter-rater agreement statistics, and correlation with human judgments. This change directly addresses the concern. revision: yes

  2. Referee: [Benchmark Construction (likely §3)] Details on task construction, validation, and potential biases are insufficient to support the claim that the benchmark 'accurately capture[s] practical real-world chart-to-code usage scenarios.' The abstract mentions a 'user-driven design perspective' and 'progressively increasing task difficulty,' but the manuscript does not report how tasks were sourced or validated (e.g., pilot studies, expert review, or checks for annotation artifacts in the 2023 tasks). This is load-bearing for the central difficulty claim.

    Authors: We appreciate the referee's observation. While the manuscript emphasizes the user-driven perspective and hierarchical difficulty, we agree that explicit details on sourcing, validation procedures, and bias considerations were not sufficiently elaborated. We have expanded §3 with a dedicated subsection describing the task construction process, including sourcing from real-world chart examples, pilot studies, expert review steps, and measures taken to mitigate annotation artifacts and selection biases. These additions provide stronger support for the claim that the benchmark reflects practical usage scenarios. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or evaluation

full rationale

The paper introduces Chart2Code as a new hierarchical benchmark and reports direct empirical results from evaluating 25 external LMMs on its tasks and metrics. No derivations, equations, or first-principles predictions exist that reduce by construction to the paper's own inputs, fitted parameters, or self-citations. Central claims rest on measurements of model performance against the constructed dataset and multi-level metrics, which are externally falsifiable and independent of any internal loop. The work qualifies as self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes a new evaluation dataset and metrics rather than new theory; it relies on the domain assumption that the tasks reflect user needs but introduces no free parameters or invented entities.

axioms (1)
  • domain assumption The tasks in Chart2Code reflect diverse real-world chart-to-code usage scenarios
    Stated directly in the abstract as 'explicitly designed from a user-driven perspective, capturing diverse real-world scenarios'

pith-pipeline@v0.9.0 · 5855 in / 1219 out tokens · 36767 ms · 2026-05-18T06:04:09.021346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation

    cs.CV 2026-05 accept novelty 8.0

    Vision2Code is a multi-domain benchmark that evaluates image-to-code generation via rendered outputs scored by a VLM rater with dataset-specific rubrics, revealing domain-dependent model performance and enabling impro...

  2. Benchmarking and Evaluating VLMs for Software Architecture Diagram Understanding

    cs.SE 2026-04 accept novelty 7.0

    SADU benchmark shows top VLMs reach only 70% accuracy on software architecture diagram tasks, revealing gaps in visual reasoning for engineering artifacts.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 2 Pith papers · 1 internal anchor

  1. [1]

    Kimi-VL Technical Report

    Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.177. URL https://aclanthology.org/2022.findings-acl.177. Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Mikhail Plekhanov, Amar Budhiraja, Despoina Magka, Vladislav V orotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cab...

  2. [2]

    zoomed-in

    Layout Modifications: Use ‘GridSpec‘ to create a complex 2x2 grid layout. The top-left main plot (spanning the 1st row and 1st column) is a composite chart (three CCA lines + CKA bar chart). The top-right subplot (spanning the 1st row and 2nd column) is a box plot, used to display the overall data distribution of four data series ( cca_top1, cca_top3, cca...

  3. [3]

    --", alpha=0.6) ax_main.set_title(

    Chart Type Conversion and Combination: In the top-right subplot, create a box plot for each of the four datasets and set appropriate labels. In the bottom zoomed-in plot, only draw the three CCA line charts and omit the CKA bar chart to emphasize the localized CCA dynamics. ... Additional Requirements: – Set the canvas size to 15×10 inches. – Use a 2×2 ‘G...

  4. [4]

    - The right side contains two subplots, each occupying a 1x1 space

    Use ‘GridSpec‘ to create a complex dashboard-style layout: - The left side contains a main plot occupying a 2x2 space. - The right side contains two subplots, each occupying a 1x1 space

  5. [5]

    - Display the absolute values and trends of the annual research count

    **Main Plot (Left Side)**: - Retain the original bar chart and exponential trend line. - Display the absolute values and trends of the annual research count

  6. [6]

    - Show the cumulative total of research counts to analyze the expansion of overall scale

    **Top-Right Subplot**: - Convert the original data into an area chart. - Show the cumulative total of research counts to analyze the expansion of overall scale

  7. [7]

    - Highlight the distribution of recent contributions

    **Bottom-Right Subplot**: - Use a donut chart to display the proportion of research counts from the last three years (2022–2024) relative to their total. - Highlight the distribution of recent contributions

  8. [8]

    bold", color=

    Add titles to all subplots and ensure a unified visual style for clear communication and coordinated layout. **Additional Modifications**: - Adjust the overall canvas size to 16 inches × 9 inches. - Configure the layout as ‘GridSpec(2,3)‘: - The main plot occupies the first and second columns of all rows. - The top-right subplot is placed in the first row...

  9. [9]

    **Top-left plot (Performance Trend Comparison):** Divide the models into two groups: ’FinTabNet’ and ’PT1M-based’

  10. [10]

    **Top-right plot (Final Performance Ranking):** Use a horizontal bar chart to show the final accuracy of all 9 models at the last epoch..y

  11. [11]

    Identify the epoch where ‘pt1m_av6‘ first surpasses ‘pubtables‘ by more than 0.05 in accuracy, and use ‘axvspan‘ to highlight the region from

    **Bottom-left plot (Key Model Showdown):** Plot the performance curves of the best model ‘pt1m_av6‘ and the baseline model ‘pubtables‘ separately. Identify the epoch where ‘pt1m_av6‘ first surpasses ‘pubtables‘ by more than 0.05 in accuracy, and use ‘axvspan‘ to highlight the region from

  12. [12]

    TopK_65k_256

    **Bottom-right plot (Performance vs. Stability):** Create a scatter plot where the X-axis represents the average accuracy of each model (mean over 30 epochs), and the Y-axis represents the standard deviation of accuracy. This plot evaluates whether high performance is accompanied by high instability. Add text labels to the best-performing, most stable, an...

  13. [13]

    Based on the visual features of the plot, you must infer the data and recreate the plot

    **Data Extraction **: Extract the actual data from the provided image. Based on the visual features of the plot, you must infer the data and recreate the plot

  14. [14]

    **Recreate the Image **: Generate the Matplotlib code that reproduces the image exactly as it appears, including all elements such as: - Plot type (scatter, line, bar, etc.) - Axis labels and titles - Colors, markers, line styles, and other visual styles - Any legends, annotations, or gridlines present in the image

  15. [15]

    It should not require any external data files or variables not already present in the code

    **Self-contained Code **: The Python code should be complete, executable, and self-contained. It should not require any external data files or variables not already present in the code. Your objective is to extract the any necessary details from the image and generate a Python script that accurately reproduces the plot. Now, please generate the Python cod...

  16. [16]

    Do not infer data from the image

    **Use Provided Data **: You must use the data provided below in the generated code. Do not infer data from the image

  17. [17]

    **Follow Instructions **: Adhere to the specific plotting instructions provided

  18. [18]

    **Match Reference Image Style **: Use the reference image to understand the required visual style (colors, markers, line styles, labels, titles, legends, etc.) and replicate it as closely as possible

  19. [19]

    It should not require any external data files

    **Self-contained Code **: The Python code should be complete, executable, and self-contained. It should not require any external data files. All data must be included within the script. 61 **Instruction:** {instruction_text} **Data:** {data_text} Now, based on the instruction, the data, and the reference image below, please generate the Python code. The o...

  20. [20]

    **Data Extraction **: Extract the necessary data from the ’data source image’

  21. [21]

    **Style Replication **: Replicate the visual style ( colors, markers, layout, etc.) from the ’style reference image’

  22. [23]

    "" level2_prompt

    **Self-contained Code **: The Python code must be complete, executable, and self-contained, without needing external data files. --- **Specific Task Instructions: ** {task_instructions} --- Now, using the data from the data source image and applying the style from the reference image according to the instructions, please generate the Python code. The outp...

  23. [24]

    **Understand the Base Image **: Analyze the provided image to understand the original plot’s data and structure

  24. [25]

    **Apply Edits **: Carefully read the instructions provided below and apply them to the base plot

  25. [26]

    "" level3_prompt

    **Generate Modified Code **: Generate a single, self- contained, and executable Python script that produces the final, edited visualization. The code should not require any external data files. **Editing Instructions: ** --- {instructions} --- Your objective is to generate a Python script that accurately reproduces the plot *after* applying the given inst...

  26. [27]

    Each sheet from the original Excel file is clearly marked

    **Use Provided Data **: The data you need to plot is provided below in CSV format. Each sheet from the original Excel file is clearly marked. You should use libraries like pandas and io.StringIO to parse this CSV data

  27. [28]

    **Style Replication **: Replicate the visual style ( colors, markers, layout, fonts, etc.) from the ’style reference image’

  28. [29]

    **Follow Instructions **: Adhere to the specific instructions provided for the task

  29. [30]

    The data should be defined directly within the code (e.g., in a 63 pandas DataFrame loaded from a string), without needing to read any external files

    **Self-contained Code **: The Python code must be complete, executable, and self-contained. The data should be defined directly within the code (e.g., in a 63 pandas DataFrame loaded from a string), without needing to read any external files. --- **Specific Task Instructions: ** {task_instructions} --- **Data from Excel File (in CSV format): ** {excel_dat...

  30. [31]

    **Strictness is Key: ** Start with a perfect score of 100 and deduct points for EVERY visual difference, including but not limited to: chart type, data points, colors, line styles, markers, labels (content, font, and position), titles, legends, axes (limits, ticks, scaling), layout, aspect ratio, and any other visual element

  31. [32]

    Even a tiny difference in line thickness or a single different pixel color must result in a lower score

    **Identical Means Identical: ** A score of 100 is ONLY for images that are pixel-perfect or visually indistinguishable. Even a tiny difference in line thickness or a single different pixel color must result in a lower score

  32. [33]

    score" ( an integer from 0 to 100) and

    **Heavy Penalties: ** Apply significant penalties for noticeable differences. For example, a different color map or a missing legend should lead to a large deduction. 64 Return ONLY a single JSON object with two keys: "score" ( an integer from 0 to 100) and "reason" (a concise, expert analysis in English, detailing every detected difference that justifies...