Generating Statistical Charts with Validation-Driven LLM Workflows
Pith reviewed 2026-05-09 19:47 UTC · model grok-4.3
The pith
A validation-driven workflow turns tabular data into aligned statistical charts and exposes limits in multimodal chart reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By decomposing chart generation into dataset screening, plot proposal, code synthesis, rendering, validation-driven refinement, description generation, and question-answer generation, the workflow produces 1500 charts spanning 24 families from 74 UCI datasets, each retained with its code, dataset context, description, and question-answer pairs. Evaluation on 16 MLLMs reveals that chart-syntax questions are nearly saturated while value extraction, comparison, and reasoning tasks remain challenging.
What carries the argument
The rendered-output validation step within the structured LLM workflow that detects and corrects visualization failures.
If this is right
- The workflow reliably produces diverse and readable charts with full alignments for research use.
- Multimodal LLMs perform well on syntax but struggle with extracting values, comparisons, and reasoning from charts.
- The resulting dataset of 1500 charts and 30003 QA pairs supports diagnostic evaluation of chart-grounded reasoning.
- Chart generation can be treated as an inspectable, iterative process rather than a single prompt.
Where Pith is reading between the lines
- Such workflows could extend to generating charts from non-tabular data sources.
- The QA pairs could serve as training data to improve model performance on chart reasoning tasks.
- Applying the method to additional datasets might uncover patterns in which chart types are hardest for current models.
Load-bearing premise
The validation of rendered chart images can consistently identify and fix issues such as poor readability and semantic mismatch that are not apparent from the data or code.
What would settle it
Human inspection finding many generated charts with persistent readability problems despite the validation step, or MLLM evaluations not showing the expected gaps in reasoning capabilities.
Figures
read the original abstract
Generating diverse, readable statistical charts from tabular data remains challenging for LLMs, as many failures become apparent after rendering and are not detectable from data or code alone. Existing chart datasets also rarely provide fully aligned artifacts, such as executable code, dataset context, and question-answer pairs. We present a structured LLM-based workflow that decomposes chart generation into dataset screening, plot proposal, code synthesis, rendering, validation-driven refinement, description generation, and question-answer generation. By incorporating rendered-output validation, the workflow addresses visualization-specific failure modes such as readability and semantic mismatch. It treats chart generation as an inspectable process rather than a one-shot prompt-to-code task, retaining each chart with its code, dataset context, description, and question-answer pairs. Applied to UCI datasets, the workflow produces 1,500 charts from 74 datasets, spanning 24 chart families and paired with 30,003 question-answer pairs. We evaluate 16 multimodal LLMs (MLLMs) on these chart-question pairs. The results show that chart-syntax questions are nearly saturated, while value extraction, comparison, and reasoning remain more challenging, illustrating the workflow's utility for diagnostic studies of chart-grounded multimodal reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a structured LLM-based workflow for generating statistical charts from tabular data that incorporates rendered-output validation to address issues like readability and semantic mismatch not detectable from code or data alone. The workflow is applied to 74 UCI datasets to create a dataset of 1,500 charts spanning 24 families, each accompanied by executable code, dataset context, descriptions, and 30,003 question-answer pairs. The authors evaluate 16 multimodal LLMs on these chart-QA pairs, finding that syntax-related questions are nearly solved while value extraction, comparison, and reasoning tasks remain challenging.
Significance. If the validation process is shown to be reliable, this work provides a valuable, large-scale, fully-aligned chart dataset that can serve as a benchmark for multimodal chart understanding and reasoning. It demonstrates the utility of iterative, inspectable generation processes over one-shot approaches and highlights specific limitations in current MLLMs for chart-grounded tasks. The scale (1,500 charts, 30k QA pairs) and diversity (24 chart families) are notable strengths.
major comments (2)
- [Workflow description (Section 3)] The central claim that rendered-output validation reliably catches visualization-specific failures (readability, semantic mismatch) invisible from code or data rests on an unverified assumption. No quantitative validation accuracy metrics, human agreement rates, inter-rater statistics, or rejection-rate breakdowns are reported for the validation step, making it impossible to determine whether the retained 1,500 charts are representative or the result of self-filtering bias.
- [Evaluation (Section 5)] The evaluation of 16 MLLMs reports that chart-syntax questions are nearly saturated while value extraction, comparison, and reasoning remain challenging. However, without baseline comparisons (e.g., to non-LLM chart QA systems) or details on how the 30,003 QA pairs were generated and balanced across chart families, the diagnostic utility of the dataset for these specific failure modes cannot be fully assessed.
minor comments (2)
- [Abstract] The abstract states that the workflow produces 'high-level evaluation outcomes' but the manuscript should include concrete quantitative results (e.g., per-question-type accuracies) to support the saturation and challenge claims.
- [Dataset construction] Clarify the distribution of the 30,003 question-answer pairs across the 1,500 charts and 24 families to allow readers to evaluate balance and potential skew toward easier chart types.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of the validation process and evaluation that we will clarify and strengthen. We respond to each major comment below.
read point-by-point responses
-
Referee: [Workflow description (Section 3)] The central claim that rendered-output validation reliably catches visualization-specific failures (readability, semantic mismatch) invisible from code or data rests on an unverified assumption. No quantitative validation accuracy metrics, human agreement rates, inter-rater statistics, or rejection-rate breakdowns are reported for the validation step, making it impossible to determine whether the retained 1,500 charts are representative or the result of self-filtering bias.
Authors: We acknowledge that the current manuscript does not provide quantitative metrics such as validation accuracy, human agreement rates, or inter-rater statistics for the rendered-output validation step. The validation procedure in Section 3 employs an LLM judge guided by explicit criteria targeting readability issues and semantic mismatches that cannot be detected from code or data alone. To address the concern, we will add rejection-rate breakdowns by failure category and representative examples of rejected charts to the revised Section 3 and appendix. We maintain that the iterative, inspectable nature of the workflow improves quality over one-shot generation; the final set spans 24 families from 74 diverse UCI datasets, which supports broad coverage rather than narrow self-filtering. We will also add a brief discussion of potential limitations arising from the absence of human validation studies. revision: partial
-
Referee: [Evaluation (Section 5)] The evaluation of 16 MLLMs reports that chart-syntax questions are nearly saturated while value extraction, comparison, and reasoning remain challenging. However, without baseline comparisons (e.g., to non-LLM chart QA systems) or details on how the 30,003 QA pairs were generated and balanced across chart families, the diagnostic utility of the dataset for these specific failure modes cannot be fully assessed.
Authors: Section 4 already outlines the QA generation process, which uses category-specific templates (syntax, value extraction, comparison, reasoning) applied uniformly across charts to produce the 30,003 pairs. We will expand this description with explicit balancing statistics per chart family and include the full generation prompts in the appendix. Regarding baselines, the paper focuses on diagnosing limitations in current multimodal LLMs rather than comparing against traditional non-LLM systems. We agree that a simple non-LLM baseline (e.g., heuristic or rule-based QA) would enhance the diagnostic framing, and we will add such a comparison to Section 5 in the revision. revision: partial
Circularity Check
No circularity: purely descriptive empirical workflow with no derivations or self-referential predictions
full rationale
The paper presents an LLM-based workflow for generating charts from UCI datasets, including steps like dataset screening, code synthesis, rendering, validation, and QA pair creation. It produces 1,500 charts and evaluates MLLMs on them. There are no mathematical equations, fitted parameters, predictions derived from inputs, or self-citations that bear the central claim. The workflow is described as an inspectable process using rendered-output validation, but this is presented as an empirical method without reducing to self-definition or fitted inputs by construction. The result is self-contained as a descriptive contribution with independent content in the generated dataset and evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can perform subtasks such as code synthesis and rendered-image validation when given appropriate prompts
Reference graph
Works this paper leans on
-
[1]
N. Chen, Y . Zhang, J. Xu, K. Ren, and Y . Yang. Viseval: A benchmark for data visualization in the era of large language models.IEEE Transactions on Visualization and Computer Graphics, 31(1):1301–1311, 2025
work page 2025
-
[2]
X. Chen, L. Gong, A. Cheung, and D. Song. PlotCoder: Hierarchical decoding for synthesizing visualization code in programmatic context. In C. Zong, F. Xia, W. Li, and R. Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume...
work page 2021
-
[3]
Y . Cui, L. W. Ge, Y . Ding, L. Harrison, F. Yang, and M. Kay. Promises and pitfalls: Using large language models to generate visualization items.IEEE Transactions on Visualization and Computer Graphics, 31(1):1094–1104, 2025
work page 2025
-
[4]
Y . Han, C. Zhang, X. Chen, X. Yang, Z. Wang, G. Yu, B. Fu, and H. Zhang. Chartllama: A multimodal llm for chart understanding and generation, 2023
work page 2023
- [5]
-
[6]
M. Huang, H. Lai, X. Zhang, W. Wu, J. Ma, L. Zhang, and J. Liu. Evochart: a benchmark and a self-training approach towards real-world chart understanding. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Ad...
work page 2025
-
[7]
M. S. Islam, R. Rahman, A. Masry, M. T. R. Laskar, M. T. Nayeem, and E. Hoque. Are large vision language models up to the challenge of chart comprehension and reasoning. In Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 3334–3368, Miami, Florida, USA, nov 2024. Association f...
work page 2024
-
[8]
S. Kantharaj, X. L. Do, R. T. Leong, J. Q. Tan, E. Hoque, and S. Joty. OpenCQA: Open-ended question answering with charts. In Y . Goldberg, Z. Kozareva, and Y . Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11817–11837, Abu Dhabi, United Arab Emirates, December
work page 2022
-
[10]
S. Kantharaj, R. T. Leong, X. Lin, A. Masry, M. Thakkar, E. Hoque, and S. Joty. Chart-to-text: A large-scale benchmark for chart summarization. In S. Muresan, P. Nakov, and A. Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4005–4023, Dublin, Ireland, May 2022. A...
work page 2022
- [11]
-
[12]
C. Liu, C. Da, X. Long, Y . Yang, Y . Zhang, and Y . Wang. Simvecvis: A dataset for enhancing mllms in visualization understanding. In2025 IEEE Visualization and Visual Analytics (VIS), pages 26–30, 2025
work page 2025
-
[13]
F. Liu, F. Piccinno, S. Krichene, C. Pang, K. Lee, M. Joshi, Y . Altun, N. Collier, and J. Eisenschlos. MatCha: Enhancing visual language pretraining with math reasoning and chart derendering. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers...
work page 2023
-
[14]
A. Masry, M. S. Islam, M. Ahmed, A. Bajaj, F. Kabir, A. Kartha, M. T. R. Laskar, M. Rahman, S. Rahman, M. Shahmohammadi, M. Thakkar, M. R. Parvez, E. Hoque, and S. Joty. ChartQAPro: A more diverse and challenging benchmark for chart question answering. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Findings of the Association for Computati...
work page 2025
-
[15]
A. Masry, P. Kavehzadeh, X. L. Do, E. Hoque, and S. Joty. UniChart: A universal vision-language pretrained model for chart comprehension and reasoning. In H. Bouamor, J. Pino, and K. Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14662–14684, Singapore, December 2023. Association for Computation...
work page 2023
-
[16]
A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In S. Muresan, P. Nakov, and A. Villavicencio, editors,Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, May 2022. Association for Computational Linguistics...
work page 2022
-
[17]
A. Masry, M. Shahmohammadi, M. R. Parvez, E. Hoque, and S. Joty. ChartInstruct: Instruction tuning for chart comprehension and reasoning. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 10387–10409, Bangkok, Thailand, aug 2024. Association for Computational Linguistics
work page 2024
-
[18]
N. Methani, P. Ganguly, M. M. Khapra, and P. Kumar. Plotqa: Reasoning over scientific plots. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), March 2020
work page 2020
-
[19]
B. Tang, A. Boggust, and A. Satyanarayan. VisText: A benchmark for semantically rich chart captioning. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7268–7298, Toronto, Canada, July 2023. Association for Computational Linguistics
work page 2023
-
[20]
Y . Tian, W. Cui, D. Deng, X. Yi, Y . Yang, H. Zhang, and Y . Wu. ChartGPT: Leveraging LLMs to Generate Charts From Abstract Natural Language .IEEE Transactions on Visualization & Computer Graphics, 31(03):1731–1745, mar 2025
work page 2025
-
[21]
Q. Wang, X. Liu, and N. Gehlenborg. Can LLMs bridge domain and visualization? a case study on high-dimension data visualization in single-cell transcriptomics.IEEE Transactions on Visualization and Computer Graphics, 32(1):342–352, 2026
work page 2026
-
[22]
S. Wang, S. Yang, W. Lin, Z. Guo, S. Cai, H. Huang, Y . Wang, J. Chen, and T. Jin. Omni-Chart-600K: A comprehensive dataset of chart types for chart understanding. In L. Chiruzzo, A. Ritter, and L. Wang, editors, Findings of the Association for Computational Linguistics: NAACL 2025, pages 4051–4069, Albuquerque, New Mexico, April 2025. Association for Com...
work page 2025
-
[23]
Y . Wu, L. Yan, L. Shen, Y . Wang, N. Tang, and Y . Luo. ChartInsights: Evaluating multimodal large language models for low-level chart question answering. In Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 12174–12200, Miami, Florida, USA, nov
work page 2024
-
[24]
Association for Computational Linguistics
- [25]
-
[26]
Z. Xu, S. Du, Y . Qi, C. Xu, C. Yuan, and J. Guo. Chartbench: A benchmark for complex visual reasoning in charts, 2024
work page 2024
-
[27]
Y . Yang, Z. Zhang, Y . Hou, Z. Li, G. Liu, A. Payani, Y .-S. Ting, and L. Zheng. Effective training data synthesis for improving mllm chart understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025
work page 2025
-
[28]
Z. Yang, Z. Zhou, S. Wang, X. Cong, X. Han, Y . Yan, Z. Liu, Z. Tan, P. Liu, D. Yu, Z. Liu, X. Shi, and M. Sun. MatPlotAgent: Method and evaluation for LLM-based agentic scientific data visualization. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 11789–11804, Bangkok, Thailand...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.