pith. machine review for the scientific record. sign in

arxiv: 2603.27064 · v2 · submitted 2026-03-28 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:35 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords chart understandingmultimodal datasetsynthetic datavision language modelsdata visualizationchart reasoningquality filtering
0
0 comments X

The pith

ChartNet supplies 1.5 million aligned multimodal chart samples to improve vision-language model performance on chart reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors create ChartNet as a large dataset of charts to address limitations in current models' ability to understand visual, numerical, and linguistic elements together. They use a code-guided synthesis method to generate diverse samples across many chart types, each with matching code, image, table, summary, and question-answer pairs. Quality filtering ensures the data is accurate and varied. Fine-tuning existing models on this dataset leads to better results on standard benchmarks, showing its value for training more capable chart interpretation systems.

Core claim

ChartNet is generated through a code-guided synthesis pipeline producing 1.5 million samples across 24 chart types from 6 libraries, with each sample containing plotting code, rendered image, data table, natural language summary, and QA with reasoning, supplemented by human-annotated, real-world, safety, and grounding subsets, all processed through rigorous quality filtering to enable robust multimodal chart understanding.

What carries the argument

Code-guided synthesis pipeline that creates fine-grained cross-modal alignments between plotting code, chart images, data tables, summaries, and reasoning-based QA pairs.

If this is right

  • Fine-tuning on ChartNet improves performance across multiple chart understanding benchmarks.
  • The dataset serves as large-scale supervision for developing foundation models with better chart comprehension.
  • Specialized subsets address safety, grounding, and real-world variations in chart data.
  • Diverse generation across plotting libraries increases representation variety in the training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models trained this way may generalize better to unseen chart styles or complex multi-panel figures.
  • Similar synthesis methods could be applied to other structured visual domains like scientific diagrams or maps.
  • Open availability of such data lowers the barrier for research in specialized visual reasoning tasks.

Load-bearing premise

The synthetic data from the code-guided pipeline and quality filters matches the statistical properties of real-world charts closely enough that models trained on it perform well on actual charts.

What would settle it

A model fine-tuned on ChartNet performing no better or worse than one trained only on existing smaller real-world chart datasets when tested on a new collection of diverse real charts would falsify the utility claim.

Figures

Figures reproduced from arXiv: 2603.27064 by Amit Alfassy, Aude Oliva, Ben Wiesel, Daniel Caraballo, Daniel Karl I. Weidele, Dhiraj Joshi, Ekaterina Arutyunova, Eli Schwartz, Florian Scheidegger, Hang Hua, Isaac Sanchez, Jovana Kondic, Luis Lastras, Minghao Liu, Pengyuan Li, Peter Staar, Qunshu Lin, Roei Herzig, Rogerio Feris, Shafiq Abedin, Sicong Jiang, Steven I. Ross, Xinyue Yu, Yagmur Gizem Cinar, Yunfei Zhao, Zexue He, Zihan Wang.

Figure 1
Figure 1. Figure 1: Code-guided chart augmentation: First, a seed chart image is passed to a vision-language model for chart reconstruction – [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of synthetic chart images generated from a single seed chart using the ChartNet pipeline. A seed chart is first [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Data attributes, chart types, and plotting packages included in ChartNet. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of chart types generated for ChartNet. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of plotting packages used in ChartNet. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of QAs with reasoning traces (CoT) generated by our pipeline [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: High-quality real-world chart with clear labels, readable annotations, sufficient quantitative structure, and non-trivial reasoning [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: High-quality real-world chart with clear labels, readable annotations, sufficient quantitative structure, and non-trivial reasoning [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: High-quality real-world chart with clear labels, readable annotations, sufficient quantitative structure, and non-trivial reasoning [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: High-quality real-world chart with clear labels, readable annotations, sufficient quantitative structure, and non-trivial reasoning [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Grounding-based Question and Answer examples. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Figure shows an example chart used with adversarial [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Human–GPT agreement on the chart data extraction [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Average scores assigned by human annotators and by [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
read the original abstract

Understanding charts requires models to jointly reason over geometric visual patterns, structured numerical data, and natural language -- a capability where current vision-language models (VLMs) remain limited. We introduce ChartNet, a high-quality, million-scale multimodal dataset designed to advance chart interpretation and reasoning. ChartNet leverages a novel code-guided synthesis pipeline to generate 1.5 million diverse chart samples spanning 24 chart types and 6 plotting libraries. Each sample consists of five aligned components: plotting code, rendered chart image, data table, natural language summary, and question-answering with reasoning, providing fine-grained cross-modal alignment. To capture the full spectrum of chart comprehension, ChartNet additionally includes specialized subsets encompassing human annotated data, real-world data, safety, and grounding. Moreover, a rigorous quality-filtering pipeline ensures visual fidelity, semantic accuracy, and diversity across chart representations. Fine-tuning on ChartNet consistently improves results across benchmarks, demonstrating its utility as large-scale supervision for multimodal models. As the largest open-source dataset of its kind, ChartNet aims to support the development of foundation models with robust and generalizable capabilities for data visualization understanding. The dataset is publicly available at https://huggingface.co/datasets/ibm-granite/ChartNet

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ChartNet, a 1.5-million-sample multimodal dataset for chart understanding constructed primarily via a code-guided synthesis pipeline across 24 chart types and 6 plotting libraries. Each sample provides five aligned modalities (plotting code, rendered image, data table, natural-language summary, and QA pairs with reasoning), augmented by human-annotated, real-world, safety, and grounding subsets and a quality-filtering pipeline. The central claim is that fine-tuning vision-language models on ChartNet produces consistent benchmark gains, establishing the dataset as the largest open-source resource for robust chart interpretation.

Significance. If the benchmark gains are shown to arise from genuine generalization rather than synthetic in-distribution performance, ChartNet would be a high-impact contribution as the largest publicly released chart dataset with explicit cross-modal alignments. The scale, diversity of generation libraries, and inclusion of specialized subsets could materially advance VLM training for visualization reasoning tasks. The open release itself is a clear strength that supports reproducibility.

major comments (2)
  1. [Abstract] Abstract: the claim that fine-tuning on ChartNet 'consistently improves results across benchmarks' is stated without any numerical scores, error bars, per-benchmark breakdowns, or ablation tables isolating the effect of the synthetic pipeline versus the real-world/human-annotated subsets. This absence directly undermines evaluation of the headline generalization claim.
  2. [Abstract] Abstract: no quantitative split is reported between the 1.5 M synthetic samples and the real-world/human-annotated subsets, nor are any ablation results shown that test whether the quality-filtering pipeline closes the distribution gap to noisy real charts (scanning artifacts, compression, hand-drawn variation). This information is load-bearing for the robustness argument.
minor comments (1)
  1. The description of the five aligned components per sample would be clearer with an explicit diagram or table showing the exact format and alignment mechanism for each modality.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify the presentation of our results and the robustness claims. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that fine-tuning on ChartNet 'consistently improves results across benchmarks' is stated without any numerical scores, error bars, per-benchmark breakdowns, or ablation tables isolating the effect of the synthetic pipeline versus the real-world/human-annotated subsets. This absence directly undermines evaluation of the headline generalization claim.

    Authors: We agree that the abstract would benefit from explicit quantitative support for the generalization claim. The full manuscript reports detailed benchmark results in the Experiments section, including per-benchmark scores, improvements over baselines, and comparisons involving the synthetic core versus supplementary subsets. To directly address the concern, we will revise the abstract to include key numerical highlights (e.g., average gains and representative per-benchmark deltas) along with pointers to the relevant tables and ablations. revision: yes

  2. Referee: [Abstract] Abstract: no quantitative split is reported between the 1.5 M synthetic samples and the real-world/human-annotated subsets, nor are any ablation results shown that test whether the quality-filtering pipeline closes the distribution gap to noisy real charts (scanning artifacts, compression, hand-drawn variation). This information is load-bearing for the robustness argument.

    Authors: We appreciate this observation. Section 3 of the manuscript already specifies the composition (core 1.5 M synthetic samples generated across 24 chart types and 6 libraries, augmented by smaller human-annotated, real-world, safety, and grounding subsets). We will update the abstract to report the approximate quantitative split. For the quality-filtering pipeline, Section 4 describes the visual, semantic, and diversity checks; however, we acknowledge that dedicated ablations isolating performance on specific real-world artifacts (scanning, compression, hand-drawn) are not exhaustively presented. We will add a concise discussion of the pipeline's design rationale and any supporting quality metrics in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical dataset construction with external benchmarks

full rationale

The paper describes a code-guided synthesis pipeline to create 1.5M chart samples across 24 types and 6 libraries, augmented by real-world and human-annotated subsets plus a quality-filtering step, followed by empirical fine-tuning results on external benchmarks. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations exist. All claims rest on observable dataset construction and measured performance gains against independent test sets, making the work self-contained without any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim rests on the assumption that code-based rendering faithfully produces diverse, high-fidelity charts and that post-generation filtering removes semantic errors without post-hoc bias.

axioms (2)
  • domain assumption Code-guided rendering produces visually faithful and semantically accurate chart images across 24 types and 6 libraries
    Invoked in the synthesis pipeline description
  • domain assumption Quality-filtering pipeline preserves diversity while ensuring visual fidelity and semantic accuracy
    Stated as ensuring the dataset quality

pith-pipeline@v0.9.0 · 5643 in / 1191 out tokens · 38077 ms · 2026-05-14T22:35:03.642887+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

131 extracted references · 131 canonical work pages · 7 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss- 20b model card.arXiv preprint arXiv:2508.10925, 2025. 4, I, II

  2. [2]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661, 2025. 2

  3. [3]

    Vqa: Visual question answering

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, 2015. 3

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6

  5. [5]

    Bain & company insights.https://www.bain

    Bain. Bain & company insights.https://www.bain. com/insights/, 2025. 5

  6. [6]

    Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv e-prints, 2024

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv e-prints, 2024. 3

  7. [7]

    Mammoth-vl: Eliciting multimodal reasoning with in- struction tuning at scale

    Jiawei Guo, Tianyu Zheng, Yizhi Li, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Graham Neubig, Wenhu Chen, and Xiang Yue. Mammoth-vl: Eliciting multimodal reasoning with in- struction tuning at scale. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics,

  8. [8]

    Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images

    Zonghao Guo, Ruyi Xu, Yuan Yao, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images. InEuropean Conference on Computer Vision, 2024. 2

  9. [9]

    Chartllama: A mul- timodal llm for chart understanding and generation, 2023

    Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. Chartllama: A mul- timodal llm for chart understanding and generation, 2023. 2

  10. [10]

    Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, 2025

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, 2025. 2

  11. [11]

    Fine- match: Aspect-based fine-grained image and text mismatch detection and correction

    Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John Collomosse, Scott Cohen, and Jiebo Luo. Fine- match: Aspect-based fine-grained image and text mismatch detection and correction. InEuropean Conference on Com- puter Vision, 2024. 3

  12. [12]

    Mmcomposition: Revisiting the compositionality of pre-trained vision-language models.arXiv preprint arXiv:2410.09733, 2024

    Hang Hua, Yunlong Tang, Ziyun Zeng, Liangliang Cao, Zhengyuan Yang, Hangfeng He, Chenliang Xu, and Jiebo Luo. Mmcomposition: Revisiting the compositionality of pre-trained vision-language models.arXiv preprint arXiv:2410.09733, 2024. 3

  13. [13]

    Finecaption: Compositional image captioning focusing on wherever you want at any granularity

    Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Soo Ye Kim, Zhifei Zhang, Yilin Wang, Jianming Zhang, Zhe Lin, and Jiebo Luo. Finecaption: Compositional image captioning focusing on wherever you want at any granularity. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025. 2

  14. [14]

    V2xum-llm: Cross-modal video summarization with tempo- ral prompt instruction tuning

    Hang Hua, Yunlong Tang, Chenliang Xu, and Jiebo Luo. V2xum-llm: Cross-modal video summarization with tempo- ral prompt instruction tuning. InProceedings of the AAAI Conference on Artificial Intelligence, 2025. 3

  15. [15]

    Mmig- bench: Towards comprehensive and explainable evaluation of multi-modal image generation models.arXiv preprint arXiv:2505.19415, 2025

    Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel Aliaga, Wei Xiong, and Jiebo Luo. Mmig- bench: Towards comprehensive and explainable evaluation of multi-modal image generation models.arXiv preprint arXiv:2505.19415, 2025. 3

  16. [16]

    Dave: A vlm vision encoder for document understanding and web agents.arXiv preprint arXiv:2512.17221, 2025

    Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, and Roei Herzig. Dave: A vlm vision encoder for document understanding and web agents.arXiv preprint arXiv:2512.17221, 2025. 3

  17. [17]

    Evochart: A benchmark and a self-training approach towards real-world chart understand- ing, 2025

    Muye Huang, Han Lai, Xinyu Zhang, Wenjun Wu, Jie Ma, Lingling Zhang, and Jun Liu. Evochart: A benchmark and a self-training approach towards real-world chart understand- ing, 2025. 2

  18. [18]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

  19. [19]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, 2019. 3

  20. [20]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, 2017. 3

  21. [21]

    Dvqa: Understanding data visualizations via ques- tion answering, 2018

    Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via ques- tion answering, 2018. 2

  22. [22]

    Fig- ureqa: An annotated figure dataset for visual reasoning,

    Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkin- son, Akos Kadar, Adam Trischler, and Yoshua Bengio. Fig- ureqa: An annotated figure dataset for visual reasoning,

  23. [23]

    Opencqa: Open-ended question answering with charts, 2022

    Shankar Kantharaj, Xuan Long Do, Rixie Tiffany Ko Leong, Jia Qing Tan, Enamul Hoque, and Shafiq Joty. Opencqa: Open-ended question answering with charts, 2022. 2

  24. [24]

    Chart-to-text: A large-scale benchmark for chart sum- marization, 2022

    Shankar Kantharaj, Rixie Tiffany Ko Leong, Xiang Lin, Ahmed Masry, Megh Thakkar, Enamul Hoque, and Shafiq Joty. Chart-to-text: A large-scale benchmark for chart sum- marization, 2022. 2

  25. [25]

    A diagram is 9 worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is 9 worth a dozen images. InEuropean conference on computer vision, 2016. 3

  26. [26]

    Chartgen: Scaling chart understanding via code-guided synthetic chart gener- ation, 2025

    Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Zexue He, Shafiq Abedin, Jennifer Sun, Ben Wiesel, Eli Schwartz, Ahmed Nassar, Bo Wu, Assaf Arbelle, Aude Oliva, Dan Gutfreund, Leonid Karlinsky, and Rogerio Feris. Chartgen: Scaling chart understanding via code-guided synthetic chart gener- ation, 2025. 3

  27. [27]

    Multimodal ArXiv: A dataset for improving scientific comprehension of large vision-language models

    Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal ArXiv: A dataset for improving scientific comprehension of large vision-language models. InProceedings of the 62nd An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), Bangkok, Thailand, 2024. As- sociation for Computation...

  28. [28]

    Chart- cap: Mitigating hallucination of dense chart captioning

    Junyoung Lim, Jaewoo Ahn, and Gunhee Kim. Chart- cap: Mitigating hallucination of dense chart captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025. 7

  29. [29]

    Mmc: Advancing multimodal chart understanding with large-scale instruction tuning, 2024

    Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, and Dong Yu. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning, 2024. 2

  30. [30]

    Improved baselines with visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023. 2, 6

  31. [31]

    Synchart: Synthesizing charts from language models, 2024

    Mengchen Liu, Qixiu Li, Dongdong Chen, Dong Chen, Jian- min Bao, and Yunsheng Li. Synchart: Synthesizing charts from language models, 2024. 2

  32. [33]

    Docling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025

    Nikolaos Livathinos, Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Panos Vagenas, Ce- sar Berrospi Ramis, Matteo Omenetti, Kasper Dinkla, Yusik Kim, et al. Docling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025. 6

  33. [34]

    SmolVLM: Redefining small and efficient multimodal models

    Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025. 6

  34. [35]

    Ok-vqa: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, 2019. 3

  35. [36]

    Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning, 2022

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning, 2022. 2

  36. [37]

    Unichart: A universal vision- language pretrained model for chart comprehension and rea- soning, 2023

    Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Ena- mul Hoque, and Shafiq Joty. Unichart: A universal vision- language pretrained model for chart comprehension and rea- soning, 2023. 2

  37. [38]

    Chartgemma: Vi- sual instruction-tuning for chart reasoning in the wild, 2024

    Ahmed Masry, Megh Thakkar, Aayush Bajaj, Aaryaman Kartha, Enamul Hoque, and Shafiq Joty. Chartgemma: Vi- sual instruction-tuning for chart reasoning in the wild, 2024. 7

  38. [39]

    Chartqapro: A more di- verse and challenging benchmark for chart question answer- ing.arXiv preprint arXiv:2504.05506, 2025

    Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tah- mid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, et al. Chartqapro: A more di- verse and challenging benchmark for chart question answer- ing.arXiv preprint arXiv:2504.05506, 2025. 2

  39. [40]

    Chartassisstant: A universal chart multimodal language model via chart-to- table pre-training and multitask instruction tuning, 2024

    Fanqing Meng, Wenqi Shao, Quanfeng Lu, Peng Gao, Kaipeng Zhang, Yu Qiao, and Ping Luo. Chartassisstant: A universal chart multimodal language model via chart-to- table pre-training and multitask instruction tuning, 2024. 2

  40. [41]

    The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.https://ai

    AI Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.https://ai. meta. com/blog/llama-4-multimodal-intelligence/, 2025. 2

  41. [42]

    Plotqa: Reasoning over scientific plots,

    Nitesh Methani et al. Plotqa: Reasoning over scientific plots,

  42. [43]

    Scientific chart qa: A per- spective from scientific literature, 2024

    Authors omitted here for brevity. Scientific chart qa: A per- spective from scientific literature, 2024. 2

  43. [44]

    Chartgalaxy: A dataset for infographic chart understanding and generation, 2025

    Authors omitted here for brevity. Chartgalaxy: A dataset for infographic chart understanding and generation, 2025. 2

  44. [45]

    Chartreasoner: Code- driven modality bridging for long-context chart reasoning,

    Authors omitted here for brevity. Chartreasoner: Code- driven modality bridging for long-context chart reasoning,

  45. [46]

    Our world in data.https : / / ourworldindata.org/, 2025

    OWID. Our world in data.https : / / ourworldindata.org/, 2025. 5

  46. [47]

    Pew research center.https : / / www

    Pew. Pew research center.https : / / www . pewresearch.org/, 2025. 5

  47. [48]

    La- tent chain-of-thought for visual reasoning.arXiv preprint arXiv:2510.23925, 2025

    Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Di- anat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao. La- tent chain-of-thought for visual reasoning.arXiv preprint arXiv:2510.23925, 2025. 2

  48. [49]

    Tang, Angie Boggust, and Arvind Satyanarayan

    Benny J. Tang, Angie Boggust, and Arvind Satyanarayan. Vistext: A benchmark for semantically rich chart captioning,

  49. [50]

    Vidcomposition: Can mllms ana- lyze compositions in compiled videos? InProceedings of the Computer Vision and Pattern Recognition Conference, 2025

    Yunlong Tang, Junjia Guo, Hang Hua, Susan Liang, Mingqian Feng, Xinyang Li, Rui Mao, Chao Huang, Jing Bi, Zeliang Zhang, et al. Vidcomposition: Can mllms ana- lyze compositions in compiled videos? InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 3

  50. [51]

    Granite vision: a lightweight, open-source multimodal model for enterprise intelligence.arXiv preprint arXiv:2502.09927, 2025

    Granite Vision Team, Leonid Karlinsky, Assaf Arbelle, Abraham Daniels, Ahmed Nassar, Amit Alfassi, Bo Wu, Eli Schwartz, Dhiraj Joshi, Jovana Kondic, et al. Granite vision: a lightweight, open-source multimodal model for enterprise intelligence.arXiv preprint arXiv:2502.09927, 2025. 3, 6

  51. [52]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 2

  52. [53]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025. 2

  53. [54]

    Trl: Trans- former reinforcement learning.https://github.com/ huggingface/trl, 2020

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Ed- ward Beeching, Tristan Thrush, Nathan Lambert, Shengyi 10 Huang, Kashif Rasul, and Quentin Gallou ´edec. Trl: Trans- former reinforcement learning.https://github.com/ huggingface/trl, 2020. 6

  54. [55]

    Charxiv: Charting gaps in realistic chart understand- ing in multimodal llms, 2024

    Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sad- hika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. Charxiv: Charting gaps in realistic chart understand- ing in multimodal llms, 2024. 2

  55. [56]

    World bank open data.https://www

    World Bank. World bank open data.https://www. worldbank.org/, 2025. 5

  56. [57]

    Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots,

    Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhix- uan Liang, Zeyu Lu, Ying Shan, and Ping Luo. Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots,

  57. [58]

    Structchart: On the schema, metric, and aug- mentation for visual chart understanding, 2024

    Renqiu Xia, Haoyang Peng, Hancheng Ye, Mingsheng Li, Xiangchao Yan, Peng Ye, Botian Shi, Yu Qiao, Junchi Yan, and Bo Zhang. Structchart: On the schema, metric, and aug- mentation for visual chart understanding, 2024. 2

  58. [59]

    Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart rea- soning, 2025

    Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Min Dou, Botian Shi, Junchi Yan, and Yu Qiao. Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart rea- soning, 2025. 2

  59. [60]

    Llava-cot: Let vision language models reason step-by-step

    Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025. 4, I

  60. [61]

    Chartbench: A benchmark for complex visual reasoning in charts, 2024

    Zhengzhuo Xu, Sinan Du, Yiyan Qi, Chengjin Xu, Chun Yuan, and Jian Guo. Chartbench: A benchmark for complex visual reasoning in charts, 2024. 2

  61. [62]

    Chartmimic: Evaluating lmm’s cross-modal reason- ing capability via chart-to-code generation, 2025

    Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, and Yujiu Yang. Chartmimic: Evaluating lmm’s cross-modal reason- ing capability via chart-to-code generation, 2025. 2, 7, XI

  62. [63]

    Scaling text-rich image understanding via code- guided synthetic multimodal data generation, 2025

    Yue Yang, Ajay Patel, Matt Deitke, Tanmay Gupta, Luca Weihs, Andrew Head, Mark Yatskar, Chris Callison-Burch, Ranjay Krishna, Aniruddha Kembhavi, and Christopher Clark. Scaling text-rich image understanding via code- guided synthetic multimodal data generation, 2025. 2, 3

  63. [64]

    Tinychart: Efficient chart understanding with visual token merging and program- of-thoughts learning, 2024

    Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, and Fei Huang. Tinychart: Efficient chart understanding with visual token merging and program- of-thoughts learning, 2024. 2, 3

  64. [65]

    Gpt-4v(ision) as a generalist evalu- ator for vision-language tasks, 2023

    Xinlu Zhang, Yujie Lu, Weizhi Wang, An Yan, Jun Yan, Lianke Qin, Heng Wang, Xifeng Yan, William Yang Wang, and Linda Ruth Petzold. Gpt-4v(ision) as a generalist evalu- ator for vision-language tasks, 2023. XI

  65. [66]

    Chartcoder: Advancing multimodal large language model for chart-to- code generation, 2025

    Xuanle Zhao, Xianzhen Luo, Qi Shi, Chi Chen, Shuo Wang, Wanxiang Che, Zhiyuan Liu, and Maosong Sun. Chartcoder: Advancing multimodal large language model for chart-to- code generation, 2025. 2

  66. [67]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 2

  67. [68]

    Multimodal c4: An open, billion-scale corpus of images interleaved with text

    Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal C4: An open, billion-scale corpus of images interleaved with text.arXiv preprint arXiv:2304.06939, 2023. 3 11 ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understand...

  68. [69]

    Identify which item has the highest metric value **and** the highest value range

  69. [70]

    Which category/entity has the highest (or lowest) value, and by approximately how much does it differ from the next (or opposite) category?

    Compare the position of that item to the item wi th the lowest metric value **and** the lowest value range, specifical ly in terms of how far each is from the plot’s center. From the image description we have a list of six ite ms with their (x, y) coordinates, where x = Metric Values and y = Value R ange. - item1: (60, 40) - item2: (20, 15) - item3: (20, ...

  70. [71]

    Missing or Incomplete Data: Is the chart blank or missing content? Are expected elements like bars, lines, or segments missing?

  71. [72]

    Labeling Issues: Are axis labels clear, complete, and readable? Are category or tick labels truncated or overlapping?

  72. [73]

    Legend Issues: Are legends accurate and consistent with the chart? Are legends readable? Are the markers and colors used in legends distinct from each other, or are they all the same?

  73. [74]

    Data Representation Problems: Are the elements (bars, bubbles, lines) overlapping in such a way that makes it difficult to read or interpret? Are the colors or sizes misleading or unexplained?

  74. [75]

    Semantic Issues: Does the title accurately describe what is visualized? Does the chart type match the data (e.g., don’t use violin plot visuals for scatter plots)? Do the segments (e.g., in pie charts) sum to 100% if they should?

  75. [76]

    Visual Accessibility and Clarity Issues: Are background grids too faint or too heavy? Is the font size legible?

  76. [77]

    Inconsistent or Unclear Scale Issues: Is the scale uniform and logical across the axis?

  77. [78]

    Yes" or

    Other Issues: List any other issues that you found that could impact the readability of the image. **ANSWER FORMAT: ** Respond in the following JSON format, where you first give a brief explanation for your evaluation and then either "Yes" or "No": ‘‘‘json { "1. Missing or Incomplete Data": [<Evaluation explanation>, <"Yes" | "No">], "2. Labeling Issues":...

  78. [79]

    Where is the <element>?

  79. [80]

    What is the <element>?

  80. [81]

    Where are the <elements>?

Showing first 80 references.