From Charts to Code: A Hierarchical Benchmark for Multimodal Models
Pith reviewed 2026-05-18 06:04 UTC · model grok-4.3
The pith
Current top multimodal models average only 0.57 on code correctness when turning charts into code and 0.22 on visual quality for edits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Chart2Code is the first hierarchical benchmark that reflects practical chart-to-code usage by scaling task complexity across three levels: chart reproduction from a reference figure, complex chart editing, and long-table to chart generation. It comprises 2023 tasks across 22 chart types and supplies multi-level metrics that assess both code correctness and the visual fidelity of rendered charts. Benchmarking 25 state-of-the-art LMMs shows that even the strongest model, GPT-5, averages only 0.57 on code-based evaluation and 0.22 on chart-quality assessment across the editing tasks, thereby demonstrating the difficulty of these real-world scenarios.
What carries the argument
The Chart2Code benchmark organized as a three-level hierarchy of tasks with multi-level metrics that separately score code correctness and the visual quality of rendered charts.
If this is right
- Models must develop stronger multimodal reasoning to handle chart modifications without losing visual accuracy.
- The progressive levels allow systematic tracking of progress on increasingly complex chart-to-code workflows.
- Emphasis on visual fidelity metrics pushes development toward outputs that match user intent in data presentation.
- The benchmark supports training and evaluation of more general-purpose multimodal models for visualization tasks.
Where Pith is reading between the lines
- The low scores on table-to-chart generation point to a possible need for better integration of dense textual data with visual output generation.
- Extending the benchmark with interactive follow-up edits could reveal whether models maintain consistency across sequential changes.
- Specialized fine-tuning on paired chart-image and code examples might close the performance gap shown by current models.
Load-bearing premise
The chosen tasks and metrics accurately reflect the practical real-world scenarios in which users convert charts to code.
What would settle it
A model that consistently scores above 0.85 on both code correctness and chart-quality metrics across all three levels would indicate that the reported difficulty is no longer representative.
Figures
read the original abstract
We introduce Chart2Code, a new benchmark for evaluating the chart understanding and code generation capabilities of large multimodal models (LMMs). Chart2Code is explicitly designed from a user-driven perspective, capturing diverse real-world scenarios and progressively increasing task difficulty. It consists of three levels: Level 1 (Chart Reproduction) reproduces charts from a reference figure and user query; Level 2 (Chart Editing) involves complex modifications such as changing chart types or adding elements; and Level 3 (Long-Table to Chart Generation) requires models to transform long, information-dense tables into faithful charts following user instructions. To our knowledge, this is the first hierarchical benchmark that reflects practical chart2code usage while systematically scaling task complexity. In total, Chart2Code contains 2,023 tasks across 22 chart types, paired with multi-level evaluation metrics that assess both code correctness and the visual fidelity of rendered charts. We benchmark 25 state-of-the-art (SoTA) LMMs, including both proprietary and the latest open-source models such as GPT-5, Qwen2.5-VL, InternVL3/3.5, MiMo-VL, and Seed-1.6-VL. Experimental results demonstrate that even the SoTA model GPT-5 averages only 0.57 on code-based evaluation and 0.22 on chart-quality assessment across the editing tasks, underscoring the difficulty of Chart2Code. We anticipate this benchmark will drive advances in multimodal reasoning and foster the development of more robust and general-purpose LMMs. Our code and data are available on Chart2Code.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Chart2Code, a hierarchical benchmark for multimodal models on chart-to-code tasks. It comprises three levels of increasing difficulty—Level 1 (Chart Reproduction), Level 2 (Chart Editing), and Level 3 (Long-Table to Chart Generation)—totaling 2,023 tasks across 22 chart types, constructed from a user-driven perspective. The authors evaluate 25 LMMs (including GPT-5, Qwen2.5-VL, and others) using multi-level metrics for code correctness and visual fidelity of rendered charts, reporting that even GPT-5 achieves only 0.57 on code-based evaluation and 0.22 on chart-quality assessment for editing tasks.
Significance. If the benchmark construction, task validity, and metric reliability hold, the work would be a useful contribution as the first explicitly hierarchical, user-driven benchmark for chart understanding and code generation. The public availability of code and data, the scale (2,023 tasks), and the concrete performance numbers across 25 models (both proprietary and open-source) provide a reproducible starting point for tracking progress in multimodal reasoning.
major comments (2)
- [Evaluation Metrics (likely §4)] The chart-quality metric procedure is not described. The abstract states that the multi-level metrics assess 'visual fidelity of rendered charts,' yet the manuscript provides no details on the evaluation method (human rating, automated perceptual similarity such as LPIPS/CLIP, or execution-based comparison after rendering). This directly affects interpretability of the headline result that GPT-5 averages 0.22 on chart-quality for editing tasks; without inter-rater agreement statistics (if human) or correlation with human judgments (if automated), it is unclear whether low scores reflect model limitations or metric noise.
- [Benchmark Construction (likely §3)] Details on task construction, validation, and potential biases are insufficient to support the claim that the benchmark 'accurately capture[s] practical real-world chart-to-code usage scenarios.' The abstract mentions a 'user-driven design perspective' and 'progressively increasing task difficulty,' but the manuscript does not report how tasks were sourced or validated (e.g., pilot studies, expert review, or checks for annotation artifacts in the 2023 tasks). This is load-bearing for the central difficulty claim.
minor comments (1)
- [Abstract and §3] The breakdown of the 2,023 tasks across the three levels and 22 chart types is not provided in the abstract or early sections; adding a table or explicit counts would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below. Where the comments identify areas needing greater clarity, we have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [Evaluation Metrics (likely §4)] The chart-quality metric procedure is not described. The abstract states that the multi-level metrics assess 'visual fidelity of rendered charts,' yet the manuscript provides no details on the evaluation method (human rating, automated perceptual similarity such as LPIPS/CLIP, or execution-based comparison after rendering). This directly affects interpretability of the headline result that GPT-5 averages 0.22 on chart-quality for editing tasks; without inter-rater agreement statistics (if human) or correlation with human judgments (if automated), it is unclear whether low scores reflect model limitations or metric noise.
Authors: We thank the referee for highlighting this important point. We agree that the description of the chart-quality assessment procedure in the original manuscript was insufficiently detailed, which limits interpretability of results such as the 0.22 score. The evaluation combines automated perceptual similarity measures on rendered outputs with human ratings on a sampled subset. We have revised §4 to provide a complete description of the procedure, including the specific metrics employed, the human evaluation protocol, inter-rater agreement statistics, and correlation with human judgments. This change directly addresses the concern. revision: yes
-
Referee: [Benchmark Construction (likely §3)] Details on task construction, validation, and potential biases are insufficient to support the claim that the benchmark 'accurately capture[s] practical real-world chart-to-code usage scenarios.' The abstract mentions a 'user-driven design perspective' and 'progressively increasing task difficulty,' but the manuscript does not report how tasks were sourced or validated (e.g., pilot studies, expert review, or checks for annotation artifacts in the 2023 tasks). This is load-bearing for the central difficulty claim.
Authors: We appreciate the referee's observation. While the manuscript emphasizes the user-driven perspective and hierarchical difficulty, we agree that explicit details on sourcing, validation procedures, and bias considerations were not sufficiently elaborated. We have expanded §3 with a dedicated subsection describing the task construction process, including sourcing from real-world chart examples, pilot studies, expert review steps, and measures taken to mitigate annotation artifacts and selection biases. These additions provide stronger support for the claim that the benchmark reflects practical usage scenarios. revision: yes
Circularity Check
No significant circularity in benchmark construction or evaluation
full rationale
The paper introduces Chart2Code as a new hierarchical benchmark and reports direct empirical results from evaluating 25 external LMMs on its tasks and metrics. No derivations, equations, or first-principles predictions exist that reduce by construction to the paper's own inputs, fitted parameters, or self-citations. Central claims rest on measurements of model performance against the constructed dataset and multi-level metrics, which are externally falsifiable and independent of any internal loop. The work qualifies as self-contained against external benchmarks with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The tasks in Chart2Code reflect diverse real-world chart-to-code usage scenarios
Forward citations
Cited by 2 Pith papers
-
Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation
Vision2Code is a multi-domain benchmark that evaluates image-to-code generation via rendered outputs scored by a VLM rater with dataset-specific rubrics, revealing domain-dependent model performance and enabling impro...
-
Benchmarking and Evaluating VLMs for Software Architecture Diagram Understanding
SADU benchmark shows top VLMs reach only 70% accuracy on software architecture diagram tasks, revealing gaps in visual reasoning for engineering artifacts.
Reference graph
Works this paper leans on
-
[1]
Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.177. URL https://aclanthology.org/2022.findings-acl.177. Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Mikhail Plekhanov, Amar Budhiraja, Despoina Magka, Vladislav V orotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cab...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2022.findings-acl.177 2022
-
[2]
Layout Modifications: Use ‘GridSpec‘ to create a complex 2x2 grid layout. The top-left main plot (spanning the 1st row and 1st column) is a composite chart (three CCA lines + CKA bar chart). The top-right subplot (spanning the 1st row and 2nd column) is a box plot, used to display the overall data distribution of four data series ( cca_top1, cca_top3, cca...
-
[3]
--", alpha=0.6) ax_main.set_title(
Chart Type Conversion and Combination: In the top-right subplot, create a box plot for each of the four datasets and set appropriate labels. In the bottom zoomed-in plot, only draw the three CCA line charts and omit the CKA bar chart to emphasize the localized CCA dynamics. ... Additional Requirements: – Set the canvas size to 15×10 inches. – Use a 2×2 ‘G...
-
[4]
- The right side contains two subplots, each occupying a 1x1 space
Use ‘GridSpec‘ to create a complex dashboard-style layout: - The left side contains a main plot occupying a 2x2 space. - The right side contains two subplots, each occupying a 1x1 space
-
[5]
- Display the absolute values and trends of the annual research count
**Main Plot (Left Side)**: - Retain the original bar chart and exponential trend line. - Display the absolute values and trends of the annual research count
-
[6]
- Show the cumulative total of research counts to analyze the expansion of overall scale
**Top-Right Subplot**: - Convert the original data into an area chart. - Show the cumulative total of research counts to analyze the expansion of overall scale
-
[7]
- Highlight the distribution of recent contributions
**Bottom-Right Subplot**: - Use a donut chart to display the proportion of research counts from the last three years (2022–2024) relative to their total. - Highlight the distribution of recent contributions
work page 2022
-
[8]
Add titles to all subplots and ensure a unified visual style for clear communication and coordinated layout. **Additional Modifications**: - Adjust the overall canvas size to 16 inches × 9 inches. - Configure the layout as ‘GridSpec(2,3)‘: - The main plot occupies the first and second columns of all rows. - The top-right subplot is placed in the first row...
work page 2017
-
[9]
**Top-left plot (Performance Trend Comparison):** Divide the models into two groups: ’FinTabNet’ and ’PT1M-based’
-
[10]
**Top-right plot (Final Performance Ranking):** Use a horizontal bar chart to show the final accuracy of all 9 models at the last epoch..y
-
[11]
**Bottom-left plot (Key Model Showdown):** Plot the performance curves of the best model ‘pt1m_av6‘ and the baseline model ‘pubtables‘ separately. Identify the epoch where ‘pt1m_av6‘ first surpasses ‘pubtables‘ by more than 0.05 in accuracy, and use ‘axvspan‘ to highlight the region from
-
[12]
**Bottom-right plot (Performance vs. Stability):** Create a scatter plot where the X-axis represents the average accuracy of each model (mean over 30 epochs), and the Y-axis represents the standard deviation of accuracy. This plot evaluates whether high performance is accompanied by high instability. Add text labels to the best-performing, most stable, an...
-
[13]
Based on the visual features of the plot, you must infer the data and recreate the plot
**Data Extraction **: Extract the actual data from the provided image. Based on the visual features of the plot, you must infer the data and recreate the plot
-
[14]
**Recreate the Image **: Generate the Matplotlib code that reproduces the image exactly as it appears, including all elements such as: - Plot type (scatter, line, bar, etc.) - Axis labels and titles - Colors, markers, line styles, and other visual styles - Any legends, annotations, or gridlines present in the image
-
[15]
It should not require any external data files or variables not already present in the code
**Self-contained Code **: The Python code should be complete, executable, and self-contained. It should not require any external data files or variables not already present in the code. Your objective is to extract the any necessary details from the image and generate a Python script that accurately reproduces the plot. Now, please generate the Python cod...
-
[16]
Do not infer data from the image
**Use Provided Data **: You must use the data provided below in the generated code. Do not infer data from the image
-
[17]
**Follow Instructions **: Adhere to the specific plotting instructions provided
-
[18]
**Match Reference Image Style **: Use the reference image to understand the required visual style (colors, markers, line styles, labels, titles, legends, etc.) and replicate it as closely as possible
-
[19]
It should not require any external data files
**Self-contained Code **: The Python code should be complete, executable, and self-contained. It should not require any external data files. All data must be included within the script. 61 **Instruction:** {instruction_text} **Data:** {data_text} Now, based on the instruction, the data, and the reference image below, please generate the Python code. The o...
-
[20]
**Data Extraction **: Extract the necessary data from the ’data source image’
-
[21]
**Style Replication **: Replicate the visual style ( colors, markers, layout, etc.) from the ’style reference image’
-
[23]
**Self-contained Code **: The Python code must be complete, executable, and self-contained, without needing external data files. --- **Specific Task Instructions: ** {task_instructions} --- Now, using the data from the data source image and applying the style from the reference image according to the instructions, please generate the Python code. The outp...
-
[24]
**Understand the Base Image **: Analyze the provided image to understand the original plot’s data and structure
-
[25]
**Apply Edits **: Carefully read the instructions provided below and apply them to the base plot
-
[26]
**Generate Modified Code **: Generate a single, self- contained, and executable Python script that produces the final, edited visualization. The code should not require any external data files. **Editing Instructions: ** --- {instructions} --- Your objective is to generate a Python script that accurately reproduces the plot *after* applying the given inst...
-
[27]
Each sheet from the original Excel file is clearly marked
**Use Provided Data **: The data you need to plot is provided below in CSV format. Each sheet from the original Excel file is clearly marked. You should use libraries like pandas and io.StringIO to parse this CSV data
-
[28]
**Style Replication **: Replicate the visual style ( colors, markers, layout, fonts, etc.) from the ’style reference image’
-
[29]
**Follow Instructions **: Adhere to the specific instructions provided for the task
-
[30]
**Self-contained Code **: The Python code must be complete, executable, and self-contained. The data should be defined directly within the code (e.g., in a 63 pandas DataFrame loaded from a string), without needing to read any external files. --- **Specific Task Instructions: ** {task_instructions} --- **Data from Excel File (in CSV format): ** {excel_dat...
-
[31]
**Strictness is Key: ** Start with a perfect score of 100 and deduct points for EVERY visual difference, including but not limited to: chart type, data points, colors, line styles, markers, labels (content, font, and position), titles, legends, axes (limits, ticks, scaling), layout, aspect ratio, and any other visual element
-
[32]
**Identical Means Identical: ** A score of 100 is ONLY for images that are pixel-perfect or visually indistinguishable. Even a tiny difference in line thickness or a single different pixel color must result in a lower score
-
[33]
score" ( an integer from 0 to 100) and
**Heavy Penalties: ** Apply significant penalties for noticeable differences. For example, a different color map or a missing legend should lead to a large deduction. 64 Return ONLY a single JSON object with two keys: "score" ( an integer from 0 to 100) and "reason" (a concise, expert analysis in English, detailing every detected difference that justifies...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.