pith. sign in

arxiv: 2605.16274 · v1 · pith:K7DIYAYInew · submitted 2026-04-06 · 💻 cs.HC · cs.AI

ChartDesign: Towards LLM Designer of Data Visualization

Pith reviewed 2026-05-21 09:17 UTC · model grok-4.3

classification 💻 cs.HC cs.AI
keywords chart designlarge language modelsdata visualizationfine-tuningvision-language modelsLoRA adaptersautomatic visualization
0
0 comments X

The pith

Fine-tuned large language models generate chart design specifications from tabular data with up to 84 percent accuracy and produce human-preferred visualizations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can learn to design charts the way human experts do by training them on pairs of data tables and real chart designs. The authors build a training set by pulling charts from public surveys and academic sources then using vision-language models to pull out attributes such as chart type, alignment, titles, and spacing in structured JSON. They fine-tune several base models with LoRA adapters so that the models map new data tables directly to design specifications. If the approach holds, automatic systems could replace handcrafted rules and make effective data visualization available without requiring specialized design skills.

Core claim

ChartDesign post-trains LLMs on a corpus of data-design pairs extracted from PewResearch and CharXiV charts. Vision-language models first label each chart for type, subtype, alignment, titles, axis labels, and bar spacing, producing JSON targets. LoRA adapters are then trained on Phi3, Qwen3, and InternVL2.5 so the models output complete design specifications given only tabular input. On held-out tests the best model reaches 84 percent accuracy against a 53 percent baseline and generalizes to unseen domains; rendered charts are judged visually appealing and preferred by human raters.

What carries the argument

LoRA fine-tuning of LLMs on JSON design attributes extracted by vision-language models from existing charts, which learns a direct mapping from tabular data to renderable chart specifications.

If this is right

  • Chart design accuracy rises from 53 percent to 84 percent on held-out data while generalizing beyond the training domains.
  • Charts produced from the generated specifications receive higher human preference scores than those from prior automatic systems.
  • Rule-based visualization tools can be replaced by learned models that require no handcrafted heuristics per domain.
  • The human-AI performance gap in chart creation narrows when models imitate expert designs at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar extraction-plus-fine-tuning pipelines could be applied to other visual design tasks such as report layouts or dashboard templates.
  • Once trained, these models could be embedded in data-analysis software to suggest or auto-generate visualizations during exploratory work.
  • Large collections of public charts become reusable training resources for teaching design principles to any generative model.

Load-bearing premise

The vision-language models extract accurate and unbiased labels for chart types, alignments, titles, and spacing from the source images.

What would settle it

Running the fine-tuned models on a fresh collection of charts from a domain absent from the training sources and measuring accuracy below 60 percent or human preference ratings no better than the baseline would falsify the performance and generalization claims.

Figures

Figures reproduced from arXiv: 2605.16274 by Aniruddh Bansal, Mohammed Afaan Ansari, Tianyi Zhou.

Figure 1
Figure 1. Figure 1: Design schema overview. A sample box plot (right) annotated with selected fields from our design JSON (left). Arrows indicate how visual components map to schema attributes such as chart_type, text_elements, axes, legend, bars_or_data_points and boxplot_style. design decisions. Additional statistics are provided in Appendix A.4. Data extraction. We extract the underlying numerical data for each chart by pr… view at source ↗
Figure 2
Figure 2. Figure 2: Dataset construction and annotation pipeline. Chart images are processed by a vision￾language model to extract the underlying data as CSV tables and infer design attributes as a structured JSON. Prompt templates are provided in Appendix A.1 [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model fine-tuning pipeline. Starting from a curated dataset of CSV tables and design JSONs, a chart LLM is trained to map tabular data and instructions to structured design specifications using an attribute-aware loss with inverse-frequency weighting. The base model is adapted via supervised fine-tuning or LoRA adapters to obtain specialised chart LLMs. variation). This agreement confirms that the judge is… view at source ↗
Figure 4
Figure 4. Figure 4: LLM-based evaluation pipeline. Predictions are compared to ground truth by flattening the JSONs and using an LLM judge to determine semantic equivalence for each attribute. Matches are aggregated into accuracy metrics [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Attribute-wise accuracy on the PewResearch + CharXiV test set. Comparison of base models, LoRA fine-tuning, and full fine-tuning across different chart attributes on the mixed-domain evaluation set. 4.1 Experimental setup We evaluate three training variants - Base (no fine-tuning), Finetuned (Pew) (1,101 PewResearch charts), and Finetuned (Pew+CharXiV) (2,118 charts) - applied to Phi-3 (4B), Qwen-3 (8B), a… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison across 5 charts. Each example shows the original PewResearch chart, the output from a base Phi-3 model, and our finetuned model (Qwen-3). We observe that the base model often misidentifies chart types and alignment, while the finetuned model closely matches the target layout and encoding. 4.4 Findings and discussion The quantitative results above reveal several insights into chart de… view at source ↗
Figure 7
Figure 7. Figure 7: Attribute-wise accuracy on the PewResearch test set. Comparison of base models, LoRA fine-tuning, and full fine-tuning across different chart attributes [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Summary of human evaluation results. Left: distribution of responses for chart type correctness and orientation/grouping questions. Centre: average ratings for layout fidelity and visual plausibility. Right: annotators’ overall preferences for the original vs. predicted charts. Design Choice Prompt You are a vision-language model that analyzes chart images and their associated CSV data. Given the chart ima… view at source ↗
Figure 9
Figure 9. Figure 9: Language-independent chart specification. A single structured design JSON enables faithful reproduction of the same visualization across Python (Matplotlib), Vega-Lite, Altair, and R (ggplot2). Each panel uses enlarged fonts for axes and legends to ensure readability after two-column scaling. See Appendix A.5 for a full description of the design schema. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
read the original abstract

Charts are the dominant medium for visualizing data, discovering patterns and trends, and communicating data driven insights, yet designing them still requires expensive human effort and expertise, such as selecting appropriate chart types, axis orientations, font sizes, and layouts. Most automatic visualization systems rely on handcrafted heuristics or simple rule matching and therefore struggle to generalize across domains. This work explores the potential of large language models (LLMs) as chart designers. We propose ChartDesign, which post-trains LLMs to imitate human experts and generate chart design attributes given tabular data. To this end, we curate a diverse training corpus of data design pairs from charts in public surveys (PewResearch) and academic repositories (CharXiV). Vision language models are used to extract data and design attributes from these charts, including chart type, sub type, alignment, titles, axis labels, and bar spacing, formatted as JSON. We then fine tune LoRA adapters on Phi3, Qwen3, and InternVL2.5 to learn a mapping from data to design specifications. ChartDesign significantly improves chart design performance over strong baselines, achieving up to 84% accuracy on a held-out test set (vs. 53% for the best baseline) and generalizing to unseen domains. We further show that charts rendered from ChartDesign generated specifications are visually appealing and human preferred, narrowing the human AI gap in data visualization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ChartDesign, which fine-tunes LLMs (Phi3, Qwen3, InternVL2.5) via LoRA adapters to map tabular data to chart design specifications (type, subtype, alignment, titles, axis labels, bar spacing) formatted as JSON. Training pairs are created by applying VLMs to extract attributes from charts in PewResearch and CharXiV repositories. The work reports up to 84% accuracy on a held-out test set (vs. 53% for the best baseline), generalization to unseen domains, and human preference for charts rendered from the generated specifications.

Significance. If the results hold after addressing evaluation concerns, the work offers a practical advance in automated visualization by showing LLMs can learn design mappings from existing chart corpora without handcrafted rules. The VLM-based data curation and multi-model fine-tuning approach, combined with both quantitative accuracy and human preference validation on rendered outputs, provides a replicable pipeline that could reduce expert effort in chart design.

major comments (2)
  1. [Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): The 84% held-out accuracy is defined by agreement with VLM-extracted labels on the test split of the same PewResearch/CharXiV corpus. This metric risks quantifying reproduction of VLM labeling patterns (including any systematic errors in subtype classification or bar-spacing heuristics) rather than human-expert design quality. The manuscript must report the precise accuracy definition (exact match across all fields vs. per-attribute), any human validation or inter-rater agreement on a sample of VLM labels, and results of an ablation that measures performance against a small set of human-annotated ground truth.
  2. [§3.2 (Baselines)] §3.2 (Baselines): The claim of improvement over 'strong baselines' (53% accuracy) lacks implementation details sufficient to rule out leakage or under-optimization. The paper should specify whether baselines are zero-shot versions of the same models, rule-based systems, or prior visualization tools, and confirm that test-set VLM labels were not used in any baseline training or prompting.
minor comments (2)
  1. [§3.1] Ensure consistent model naming (Phi3 vs. Phi-3) and clarify whether 'post-trains' refers to standard LoRA fine-tuning or an additional alignment stage.
  2. [§5] Human preference study results should include statistical significance tests and details on participant expertise and chart rendering pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on evaluation rigor and baseline details. These comments highlight important aspects of validating our approach against human expertise and ensuring fair comparisons. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): The 84% held-out accuracy is defined by agreement with VLM-extracted labels on the test split of the same PewResearch/CharXiV corpus. This metric risks quantifying reproduction of VLM labeling patterns (including any systematic errors in subtype classification or bar-spacing heuristics) rather than human-expert design quality. The manuscript must report the precise accuracy definition (exact match across all fields vs. per-attribute), any human validation or inter-rater agreement on a sample of VLM labels, and results of an ablation that measures performance against a small set of human-annotated ground truth.

    Authors: We acknowledge that the reported accuracy measures agreement with VLM-extracted labels on the held-out split, as the training corpus itself is constructed via VLM annotation of existing charts. This setup evaluates how effectively the fine-tuned models learn the data-to-design mapping present in the corpus. To strengthen validation against human-expert quality, we will revise §4 to explicitly define the metric as both exact JSON match across all fields and per-attribute accuracies (chart type, subtype, alignment, titles, axis labels, bar spacing). We will additionally annotate a random sample of 100 test instances with two human visualization experts, report inter-rater agreement (Cohen's kappa), and provide an ablation comparing model accuracy to these human ground-truth labels. This will be included in the revised evaluation section. revision: yes

  2. Referee: [§3.2 (Baselines)] §3.2 (Baselines): The claim of improvement over 'strong baselines' (53% accuracy) lacks implementation details sufficient to rule out leakage or under-optimization. The paper should specify whether baselines are zero-shot versions of the same models, rule-based systems, or prior visualization tools, and confirm that test-set VLM labels were not used in any baseline training or prompting.

    Authors: We will expand §3.2 with full implementation details. The primary baselines are zero-shot prompting of the identical base models (Phi-3, Qwen3, InternVL2.5) using the same prompt template and output format but without any LoRA adaptation or training. We also include a rule-based heuristic baseline derived from standard visualization guidelines (e.g., chart-type selection rules from prior literature). We explicitly confirm that the held-out test-set VLM labels were never used in baseline prompting, training, or hyperparameter tuning; all baselines were evaluated solely on the same VLM-derived test labels for direct comparison, with no access to the training pairs. These clarifications will eliminate any ambiguity regarding leakage or optimization. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper constructs its training corpus by applying external vision-language models to extract design attributes from public repositories (PewResearch, CharXiV), then fine-tunes LLMs via LoRA and evaluates accuracy on a held-out test set using the same extraction process. This follows standard supervised learning without any self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations. Human preference studies on rendered charts supply independent external validation. No derivation step reduces by construction to its own inputs; the reported gains (84% vs. 53%) are conventional benchmark comparisons rather than tautological quantities.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that VLM-extracted design attributes faithfully represent expert human choices and that fine-tuning on these pairs produces generalizable mappings; no new physical entities or mathematical axioms are introduced.

free parameters (1)
  • LoRA adapter rank and learning rate
    Hyperparameters chosen for fine-tuning the three base models; their specific values are not stated in the abstract but affect the reported accuracy.
axioms (1)
  • domain assumption Vision-language models can reliably extract structured design attributes (chart type, alignment, labels, spacing) from rendered charts without systematic bias.
    Invoked when the training corpus is constructed from public charts; any extraction error directly affects the learned mapping.

pith-pipeline@v0.9.0 · 5784 in / 1546 out tokens · 27228 ms · 2026-05-21T09:17:31.938486+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

  1. [1]

    2 Jae Kim, Lisa Park, and Rui Wang

    URLhttps://aclanthology.org/2022.acl-long.277. 2 Jae Kim, Lisa Park, and Rui Wang. Grounding visual language models for chart understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1234–1243,

  2. [2]

    Mackinlay

    1 Haotian Liu, Pengchuan Jen, Luowei Zhang, et al. Visual instruction tuning of large language models. InAdvances in Neural Information Processing Systems, 2023. 1, 2 Chen Ma, Hong Ren, and Benjamin Bach. Adavis: Adaptive visualization recommendation through learned design patterns. InProceedings of the ACM Conference on Human Factors in Computing Systems...

  3. [3]

    3 Steven F Roth, John Kolojejchick, Joe Mattis, and Jade Goldstein

    Accessed 2025-10-26. 3 Steven F Roth, John Kolojejchick, Joe Mattis, and Jade Goldstein. Interactive graphic design using automatic presentation knowledge. InProceedings of the SIGCHI conference on Human factors in computing systems, pp. 112–117, 1994. 2 Rohan Sharma, Feng Li, and Ning Xu. Plotgen: Generating matplotlib code from natural language descript...

  4. [4]

    not applicable

    2, 3 Yuki Watanabe, Michael Johnson, and John Kelleher. Visml: Learning aesthetic design rules for automated visualization recommendation. InIEEE Visualization, 2022. 2 Kanit Wongsuphasawat, Leilani Battle, Arvind Srinivasan, et al. V oyager: Exploratory data analysis via recommendation. InProceedings of the ACM Conference on Human Factors in Computing Sy...

  5. [5]

    Chart type correctness: Does the predicted chart use the same chart family (bar, line, scatter, area, box or pie) as the original? (responses: Yes/No/Unsure)

  6. [6]

    horizontal) and grouping or stacking of elements consistent with the original? (responses: Yes/No/Unsure)

    Orientation and grouping: Are the orientation (vertical vs. horizontal) and grouping or stacking of elements consistent with the original? (responses: Yes/No/Unsure)

  7. [7]

    Layout fidelity: How closely does the predicted chart match the original layout (titles, axis labels, legend placement and spacing)? (Likert scale: 1 = not at all, 5 = identical)

  8. [8]

    Visual plausibility: Does the predicted chart look visually plausible and free of obvious rendering errors such as overlaps or truncation? (Likert scale: 1 = poor, 5 = excellent)

  9. [9]

    Table 5 in Appendix A.10 reports aggregated results, and Figure 8 visualises the response distributions

    Overall preference: Which chart better communicates the main information or patterns of the data? (options: Original/Predicted/Both equally) Annotators answered these questions for a random subset of charts covering bar, line, scatter, area and box plots across both PewResearch and CharXiV domains. Table 5 in Appendix A.10 reports aggregated results, and ...