arxiv: 2603.27064 · v2 · submitted 2026-03-28 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding

Jovana Kondic , Pengyuan Li , Dhiraj Joshi , Isaac Sanchez , Ben Wiesel , Shafiq Abedin , Amit Alfassy , Eli Schwartz

show 19 more authors

Daniel Caraballo Yagmur Gizem Cinar Florian Scheidegger Steven I. Ross Daniel Karl I. Weidele Hang Hua Ekaterina Arutyunova Roei Herzig Zexue He Zihan Wang Xinyue Yu Yunfei Zhao Sicong Jiang Minghao Liu Qunshu Lin Peter Staar Luis Lastras Aude Oliva Rogerio Feris

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:35 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords chart understandingmultimodal datasetsynthetic datavision language modelsdata visualizationchart reasoningquality filtering

0 comments

The pith

ChartNet supplies 1.5 million aligned multimodal chart samples to improve vision-language model performance on chart reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors create ChartNet as a large dataset of charts to address limitations in current models' ability to understand visual, numerical, and linguistic elements together. They use a code-guided synthesis method to generate diverse samples across many chart types, each with matching code, image, table, summary, and question-answer pairs. Quality filtering ensures the data is accurate and varied. Fine-tuning existing models on this dataset leads to better results on standard benchmarks, showing its value for training more capable chart interpretation systems.

Core claim

ChartNet is generated through a code-guided synthesis pipeline producing 1.5 million samples across 24 chart types from 6 libraries, with each sample containing plotting code, rendered image, data table, natural language summary, and QA with reasoning, supplemented by human-annotated, real-world, safety, and grounding subsets, all processed through rigorous quality filtering to enable robust multimodal chart understanding.

What carries the argument

Code-guided synthesis pipeline that creates fine-grained cross-modal alignments between plotting code, chart images, data tables, summaries, and reasoning-based QA pairs.

If this is right

Fine-tuning on ChartNet improves performance across multiple chart understanding benchmarks.
The dataset serves as large-scale supervision for developing foundation models with better chart comprehension.
Specialized subsets address safety, grounding, and real-world variations in chart data.
Diverse generation across plotting libraries increases representation variety in the training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models trained this way may generalize better to unseen chart styles or complex multi-panel figures.
Similar synthesis methods could be applied to other structured visual domains like scientific diagrams or maps.
Open availability of such data lowers the barrier for research in specialized visual reasoning tasks.

Load-bearing premise

The synthetic data from the code-guided pipeline and quality filters matches the statistical properties of real-world charts closely enough that models trained on it perform well on actual charts.

What would settle it

A model fine-tuned on ChartNet performing no better or worse than one trained only on existing smaller real-world chart datasets when tested on a new collection of diverse real charts would falsify the utility claim.

Figures

Figures reproduced from arXiv: 2603.27064 by Amit Alfassy, Aude Oliva, Ben Wiesel, Daniel Caraballo, Daniel Karl I. Weidele, Dhiraj Joshi, Ekaterina Arutyunova, Eli Schwartz, Florian Scheidegger, Hang Hua, Isaac Sanchez, Jovana Kondic, Luis Lastras, Minghao Liu, Pengyuan Li, Peter Staar, Qunshu Lin, Roei Herzig, Rogerio Feris, Shafiq Abedin, Sicong Jiang, Steven I. Ross, Xinyue Yu, Yagmur Gizem Cinar, Yunfei Zhao, Zexue He, Zihan Wang.

**Figure 1.** Figure 1: Code-guided chart augmentation: First, a seed chart image is passed to a vision-language model for chart reconstruction – [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: An illustration of synthetic chart images generated from a single seed chart using the ChartNet pipeline. A seed chart is first [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Data attributes, chart types, and plotting packages included in ChartNet. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of chart types generated for ChartNet. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of plotting packages used in ChartNet. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of QAs with reasoning traces (CoT) generated by our pipeline [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: High-quality real-world chart with clear labels, readable annotations, sufficient quantitative structure, and non-trivial reasoning [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: High-quality real-world chart with clear labels, readable annotations, sufficient quantitative structure, and non-trivial reasoning [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: High-quality real-world chart with clear labels, readable annotations, sufficient quantitative structure, and non-trivial reasoning [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: High-quality real-world chart with clear labels, readable annotations, sufficient quantitative structure, and non-trivial reasoning [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Grounding-based Question and Answer examples. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Figure shows an example chart used with adversarial [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Human–GPT agreement on the chart data extraction [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Average scores assigned by human annotators and by [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

read the original abstract

Understanding charts requires models to jointly reason over geometric visual patterns, structured numerical data, and natural language -- a capability where current vision-language models (VLMs) remain limited. We introduce ChartNet, a high-quality, million-scale multimodal dataset designed to advance chart interpretation and reasoning. ChartNet leverages a novel code-guided synthesis pipeline to generate 1.5 million diverse chart samples spanning 24 chart types and 6 plotting libraries. Each sample consists of five aligned components: plotting code, rendered chart image, data table, natural language summary, and question-answering with reasoning, providing fine-grained cross-modal alignment. To capture the full spectrum of chart comprehension, ChartNet additionally includes specialized subsets encompassing human annotated data, real-world data, safety, and grounding. Moreover, a rigorous quality-filtering pipeline ensures visual fidelity, semantic accuracy, and diversity across chart representations. Fine-tuning on ChartNet consistently improves results across benchmarks, demonstrating its utility as large-scale supervision for multimodal models. As the largest open-source dataset of its kind, ChartNet aims to support the development of foundation models with robust and generalizable capabilities for data visualization understanding. The dataset is publicly available at https://huggingface.co/datasets/ibm-granite/ChartNet

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ChartNet releases a large synthetic chart dataset with strong five-way alignment, but the abstract gives almost no numbers on filtering or benchmark gains and the synthetic-heavy construction leaves generalization to real charts untested.

read the letter

ChartNet is a dataset release built around 1.5 million code-generated chart samples spanning 24 types and six libraries. Each sample lines up plotting code, the rendered image, the underlying table, a natural-language summary, and QA pairs with reasoning. The authors added smaller real-world and human-annotated subsets plus safety and grounding collections, and they put the whole thing on Hugging Face for open use. That scale plus the clean cross-modal links is the concrete new thing relative to earlier chart datasets, which were smaller and less completely aligned. The synthesis route is a practical way to get perfect supervision without hand labeling every element. The public release itself is the part that will see immediate use by groups training VLMs on visualization tasks. The soft spots sit in the validation. The abstract states that fine-tuning improves results across benchmarks, yet supplies no deltas, no baseline tables, and no error analysis on the quality filter. The stress-test point holds: most of the data comes from clean code renders, so it is unclear whether the reported gains survive scanning noise, compression, or hand-drawn variations that appear in real documents. The paper mentions real subsets but gives no sizes or ablations that isolate their contribution. This work is aimed at researchers who need large volumes of aligned chart data for pre-training or evaluation. Anyone building or testing multimodal models for data visualization will find the release directly usable even if the experiments stay light. It deserves peer review because the dataset construction is a tangible step forward, though the authors should expect questions on the filtering metrics and on whether the synthetic distribution actually transfers.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ChartNet, a 1.5-million-sample multimodal dataset for chart understanding constructed primarily via a code-guided synthesis pipeline across 24 chart types and 6 plotting libraries. Each sample provides five aligned modalities (plotting code, rendered image, data table, natural-language summary, and QA pairs with reasoning), augmented by human-annotated, real-world, safety, and grounding subsets and a quality-filtering pipeline. The central claim is that fine-tuning vision-language models on ChartNet produces consistent benchmark gains, establishing the dataset as the largest open-source resource for robust chart interpretation.

Significance. If the benchmark gains are shown to arise from genuine generalization rather than synthetic in-distribution performance, ChartNet would be a high-impact contribution as the largest publicly released chart dataset with explicit cross-modal alignments. The scale, diversity of generation libraries, and inclusion of specialized subsets could materially advance VLM training for visualization reasoning tasks. The open release itself is a clear strength that supports reproducibility.

major comments (2)

[Abstract] Abstract: the claim that fine-tuning on ChartNet 'consistently improves results across benchmarks' is stated without any numerical scores, error bars, per-benchmark breakdowns, or ablation tables isolating the effect of the synthetic pipeline versus the real-world/human-annotated subsets. This absence directly undermines evaluation of the headline generalization claim.
[Abstract] Abstract: no quantitative split is reported between the 1.5 M synthetic samples and the real-world/human-annotated subsets, nor are any ablation results shown that test whether the quality-filtering pipeline closes the distribution gap to noisy real charts (scanning artifacts, compression, hand-drawn variation). This information is load-bearing for the robustness argument.

minor comments (1)

The description of the five aligned components per sample would be clearer with an explicit diagram or table showing the exact format and alignment mechanism for each modality.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify the presentation of our results and the robustness claims. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that fine-tuning on ChartNet 'consistently improves results across benchmarks' is stated without any numerical scores, error bars, per-benchmark breakdowns, or ablation tables isolating the effect of the synthetic pipeline versus the real-world/human-annotated subsets. This absence directly undermines evaluation of the headline generalization claim.

Authors: We agree that the abstract would benefit from explicit quantitative support for the generalization claim. The full manuscript reports detailed benchmark results in the Experiments section, including per-benchmark scores, improvements over baselines, and comparisons involving the synthetic core versus supplementary subsets. To directly address the concern, we will revise the abstract to include key numerical highlights (e.g., average gains and representative per-benchmark deltas) along with pointers to the relevant tables and ablations. revision: yes
Referee: [Abstract] Abstract: no quantitative split is reported between the 1.5 M synthetic samples and the real-world/human-annotated subsets, nor are any ablation results shown that test whether the quality-filtering pipeline closes the distribution gap to noisy real charts (scanning artifacts, compression, hand-drawn variation). This information is load-bearing for the robustness argument.

Authors: We appreciate this observation. Section 3 of the manuscript already specifies the composition (core 1.5 M synthetic samples generated across 24 chart types and 6 libraries, augmented by smaller human-annotated, real-world, safety, and grounding subsets). We will update the abstract to report the approximate quantitative split. For the quality-filtering pipeline, Section 4 describes the visual, semantic, and diversity checks; however, we acknowledge that dedicated ablations isolating performance on specific real-world artifacts (scanning, compression, hand-drawn) are not exhaustively presented. We will add a concise discussion of the pipeline's design rationale and any supporting quality metrics in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical dataset construction with external benchmarks

full rationale

The paper describes a code-guided synthesis pipeline to create 1.5M chart samples across 24 types and 6 libraries, augmented by real-world and human-annotated subsets plus a quality-filtering step, followed by empirical fine-tuning results on external benchmarks. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations exist. All claims rest on observable dataset construction and measured performance gains against independent test sets, making the work self-contained without any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim rests on the assumption that code-based rendering faithfully produces diverse, high-fidelity charts and that post-generation filtering removes semantic errors without post-hoc bias.

axioms (2)

domain assumption Code-guided rendering produces visually faithful and semantically accurate chart images across 24 types and 6 libraries
Invoked in the synthesis pipeline description
domain assumption Quality-filtering pipeline preserves diversity while ensuring visual fidelity and semantic accuracy
Stated as ensuring the dataset quality

pith-pipeline@v0.9.0 · 5643 in / 1191 out tokens · 38077 ms · 2026-05-14T22:35:03.642887+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ChartNet leverages a novel code-guided synthesis pipeline to generate 1.5 million diverse chart samples spanning 24 chart types and 6 plotting libraries... rigorous quality-filtering pipeline ensures visual fidelity, semantic accuracy, and diversity
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat.induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use pixtral-large-instruct-2411 in the Chart-to-Code Reconstruction, Quality Filtering... gpt-oss-120b in the Code-Guided Chart Augmentation stage

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

131 extracted references · 131 canonical work pages · 7 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss- 20b model card.arXiv preprint arXiv:2508.10925, 2025. 4, I, II

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, 2015. 3

work page 2015
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Bain & company insights.https://www.bain

Bain. Bain & company insights.https://www.bain. com/insights/, 2025. 5

work page 2025
[6]

Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv e-prints, 2024

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv e-prints, 2024. 3

work page 2024
[7]

Mammoth-vl: Eliciting multimodal reasoning with in- struction tuning at scale

Jiawei Guo, Tianyu Zheng, Yizhi Li, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Graham Neubig, Wenhu Chen, and Xiang Yue. Mammoth-vl: Eliciting multimodal reasoning with in- struction tuning at scale. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics,

work page
[8]

Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images

Zonghao Guo, Ruyi Xu, Yuan Yao, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images. InEuropean Conference on Computer Vision, 2024. 2

work page 2024
[9]

Chartllama: A mul- timodal llm for chart understanding and generation, 2023

Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. Chartllama: A mul- timodal llm for chart understanding and generation, 2023. 2

work page 2023
[10]

Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, 2025

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, 2025. 2

work page 2025
[11]

Fine- match: Aspect-based fine-grained image and text mismatch detection and correction

Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John Collomosse, Scott Cohen, and Jiebo Luo. Fine- match: Aspect-based fine-grained image and text mismatch detection and correction. InEuropean Conference on Com- puter Vision, 2024. 3

work page 2024
[12]

Mmcomposition: Revisiting the compositionality of pre-trained vision-language models.arXiv preprint arXiv:2410.09733, 2024

Hang Hua, Yunlong Tang, Ziyun Zeng, Liangliang Cao, Zhengyuan Yang, Hangfeng He, Chenliang Xu, and Jiebo Luo. Mmcomposition: Revisiting the compositionality of pre-trained vision-language models.arXiv preprint arXiv:2410.09733, 2024. 3

work page arXiv 2024
[13]

Finecaption: Compositional image captioning focusing on wherever you want at any granularity

Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Soo Ye Kim, Zhifei Zhang, Yilin Wang, Jianming Zhang, Zhe Lin, and Jiebo Luo. Finecaption: Compositional image captioning focusing on wherever you want at any granularity. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025. 2

work page 2025
[14]

V2xum-llm: Cross-modal video summarization with tempo- ral prompt instruction tuning

Hang Hua, Yunlong Tang, Chenliang Xu, and Jiebo Luo. V2xum-llm: Cross-modal video summarization with tempo- ral prompt instruction tuning. InProceedings of the AAAI Conference on Artificial Intelligence, 2025. 3

work page 2025
[15]

Mmig- bench: Towards comprehensive and explainable evaluation of multi-modal image generation models.arXiv preprint arXiv:2505.19415, 2025

Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel Aliaga, Wei Xiong, and Jiebo Luo. Mmig- bench: Towards comprehensive and explainable evaluation of multi-modal image generation models.arXiv preprint arXiv:2505.19415, 2025. 3

work page arXiv 2025
[16]

Dave: A vlm vision encoder for document understanding and web agents.arXiv preprint arXiv:2512.17221, 2025

Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, and Roei Herzig. Dave: A vlm vision encoder for document understanding and web agents.arXiv preprint arXiv:2512.17221, 2025. 3

work page arXiv 2025
[17]

Evochart: A benchmark and a self-training approach towards real-world chart understand- ing, 2025

Muye Huang, Han Lai, Xinyu Zhang, Wenjun Wu, Jie Ma, Lingling Zhang, and Jun Liu. Evochart: A benchmark and a self-training approach towards real-world chart understand- ing, 2025. 2

work page 2025
[18]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, 2019. 3

work page 2019
[20]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, 2017. 3

work page 2017
[21]

Dvqa: Understanding data visualizations via ques- tion answering, 2018

Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via ques- tion answering, 2018. 2

work page 2018
[22]

Fig- ureqa: An annotated figure dataset for visual reasoning,

Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkin- son, Akos Kadar, Adam Trischler, and Yoshua Bengio. Fig- ureqa: An annotated figure dataset for visual reasoning,

work page
[23]

Opencqa: Open-ended question answering with charts, 2022

Shankar Kantharaj, Xuan Long Do, Rixie Tiffany Ko Leong, Jia Qing Tan, Enamul Hoque, and Shafiq Joty. Opencqa: Open-ended question answering with charts, 2022. 2

work page 2022
[24]

Chart-to-text: A large-scale benchmark for chart sum- marization, 2022

Shankar Kantharaj, Rixie Tiffany Ko Leong, Xiang Lin, Ahmed Masry, Megh Thakkar, Enamul Hoque, and Shafiq Joty. Chart-to-text: A large-scale benchmark for chart sum- marization, 2022. 2

work page 2022
[25]

A diagram is 9 worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is 9 worth a dozen images. InEuropean conference on computer vision, 2016. 3

work page 2016
[26]

Chartgen: Scaling chart understanding via code-guided synthetic chart gener- ation, 2025

Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Zexue He, Shafiq Abedin, Jennifer Sun, Ben Wiesel, Eli Schwartz, Ahmed Nassar, Bo Wu, Assaf Arbelle, Aude Oliva, Dan Gutfreund, Leonid Karlinsky, and Rogerio Feris. Chartgen: Scaling chart understanding via code-guided synthetic chart gener- ation, 2025. 3

work page 2025
[27]

Multimodal ArXiv: A dataset for improving scientific comprehension of large vision-language models

Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal ArXiv: A dataset for improving scientific comprehension of large vision-language models. InProceedings of the 62nd An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), Bangkok, Thailand, 2024. As- sociation for Computation...

work page 2024
[28]

Chart- cap: Mitigating hallucination of dense chart captioning

Junyoung Lim, Jaewoo Ahn, and Gunhee Kim. Chart- cap: Mitigating hallucination of dense chart captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025. 7

work page 2025
[29]

Mmc: Advancing multimodal chart understanding with large-scale instruction tuning, 2024

Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, and Dong Yu. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning, 2024. 2

work page 2024
[30]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023. 2, 6

work page 2023
[31]

Synchart: Synthesizing charts from language models, 2024

Mengchen Liu, Qixiu Li, Dongdong Chen, Dong Chen, Jian- min Bao, and Yunsheng Li. Synchart: Synthesizing charts from language models, 2024. 2

work page 2024
[33]

Docling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025

Nikolaos Livathinos, Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Panos Vagenas, Ce- sar Berrospi Ramis, Matteo Omenetti, Kasper Dinkla, Yusik Kim, et al. Docling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025. 6

work page arXiv 2025
[34]

SmolVLM: Redefining small and efficient multimodal models

Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, 2019. 3

work page 2019
[36]

Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning, 2022

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning, 2022. 2

work page 2022
[37]

Unichart: A universal vision- language pretrained model for chart comprehension and rea- soning, 2023

Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Ena- mul Hoque, and Shafiq Joty. Unichart: A universal vision- language pretrained model for chart comprehension and rea- soning, 2023. 2

work page 2023
[38]

Chartgemma: Vi- sual instruction-tuning for chart reasoning in the wild, 2024

Ahmed Masry, Megh Thakkar, Aayush Bajaj, Aaryaman Kartha, Enamul Hoque, and Shafiq Joty. Chartgemma: Vi- sual instruction-tuning for chart reasoning in the wild, 2024. 7

work page 2024
[39]

Chartqapro: A more di- verse and challenging benchmark for chart question answer- ing.arXiv preprint arXiv:2504.05506, 2025

Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tah- mid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, et al. Chartqapro: A more di- verse and challenging benchmark for chart question answer- ing.arXiv preprint arXiv:2504.05506, 2025. 2

work page arXiv 2025
[40]

Chartassisstant: A universal chart multimodal language model via chart-to- table pre-training and multitask instruction tuning, 2024

Fanqing Meng, Wenqi Shao, Quanfeng Lu, Peng Gao, Kaipeng Zhang, Yu Qiao, and Ping Luo. Chartassisstant: A universal chart multimodal language model via chart-to- table pre-training and multitask instruction tuning, 2024. 2

work page 2024
[41]

The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.https://ai

AI Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.https://ai. meta. com/blog/llama-4-multimodal-intelligence/, 2025. 2

work page 2025
[42]

Plotqa: Reasoning over scientific plots,

Nitesh Methani et al. Plotqa: Reasoning over scientific plots,

work page
[43]

Scientific chart qa: A per- spective from scientific literature, 2024

Authors omitted here for brevity. Scientific chart qa: A per- spective from scientific literature, 2024. 2

work page 2024
[44]

Chartgalaxy: A dataset for infographic chart understanding and generation, 2025

Authors omitted here for brevity. Chartgalaxy: A dataset for infographic chart understanding and generation, 2025. 2

work page 2025
[45]

Chartreasoner: Code- driven modality bridging for long-context chart reasoning,

Authors omitted here for brevity. Chartreasoner: Code- driven modality bridging for long-context chart reasoning,

work page
[46]

Our world in data.https : / / ourworldindata.org/, 2025

OWID. Our world in data.https : / / ourworldindata.org/, 2025. 5

work page 2025
[47]

Pew research center.https : / / www

Pew. Pew research center.https : / / www . pewresearch.org/, 2025. 5

work page 2025
[48]

La- tent chain-of-thought for visual reasoning.arXiv preprint arXiv:2510.23925, 2025

Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Di- anat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao. La- tent chain-of-thought for visual reasoning.arXiv preprint arXiv:2510.23925, 2025. 2

work page arXiv 2025
[49]

Tang, Angie Boggust, and Arvind Satyanarayan

Benny J. Tang, Angie Boggust, and Arvind Satyanarayan. Vistext: A benchmark for semantically rich chart captioning,

work page
[50]

Vidcomposition: Can mllms ana- lyze compositions in compiled videos? InProceedings of the Computer Vision and Pattern Recognition Conference, 2025

Yunlong Tang, Junjia Guo, Hang Hua, Susan Liang, Mingqian Feng, Xinyang Li, Rui Mao, Chao Huang, Jing Bi, Zeliang Zhang, et al. Vidcomposition: Can mllms ana- lyze compositions in compiled videos? InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 3

work page 2025
[51]

Granite vision: a lightweight, open-source multimodal model for enterprise intelligence.arXiv preprint arXiv:2502.09927, 2025

Granite Vision Team, Leonid Karlinsky, Assaf Arbelle, Abraham Daniels, Ahmed Nassar, Amit Alfassi, Bo Wu, Eli Schwartz, Dhiraj Joshi, Jovana Kondic, et al. Granite vision: a lightweight, open-source multimodal model for enterprise intelligence.arXiv preprint arXiv:2502.09927, 2025. 3, 6

work page arXiv 2025
[52]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 2

work page 2025
[54]

Trl: Trans- former reinforcement learning.https://github.com/ huggingface/trl, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Ed- ward Beeching, Tristan Thrush, Nathan Lambert, Shengyi 10 Huang, Kashif Rasul, and Quentin Gallou ´edec. Trl: Trans- former reinforcement learning.https://github.com/ huggingface/trl, 2020. 6

work page 2020
[55]

Charxiv: Charting gaps in realistic chart understand- ing in multimodal llms, 2024

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sad- hika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. Charxiv: Charting gaps in realistic chart understand- ing in multimodal llms, 2024. 2

work page 2024
[56]

World bank open data.https://www

World Bank. World bank open data.https://www. worldbank.org/, 2025. 5

work page 2025
[57]

Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots,

Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhix- uan Liang, Zeyu Lu, Ying Shan, and Ping Luo. Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots,

work page
[58]

Structchart: On the schema, metric, and aug- mentation for visual chart understanding, 2024

Renqiu Xia, Haoyang Peng, Hancheng Ye, Mingsheng Li, Xiangchao Yan, Peng Ye, Botian Shi, Yu Qiao, Junchi Yan, and Bo Zhang. Structchart: On the schema, metric, and aug- mentation for visual chart understanding, 2024. 2

work page 2024
[59]

Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart rea- soning, 2025

Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Min Dou, Botian Shi, Junchi Yan, and Yu Qiao. Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart rea- soning, 2025. 2

work page 2025
[60]

Llava-cot: Let vision language models reason step-by-step

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025. 4, I

work page 2025
[61]

Chartbench: A benchmark for complex visual reasoning in charts, 2024

Zhengzhuo Xu, Sinan Du, Yiyan Qi, Chengjin Xu, Chun Yuan, and Jian Guo. Chartbench: A benchmark for complex visual reasoning in charts, 2024. 2

work page 2024
[62]

Chartmimic: Evaluating lmm’s cross-modal reason- ing capability via chart-to-code generation, 2025

Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, and Yujiu Yang. Chartmimic: Evaluating lmm’s cross-modal reason- ing capability via chart-to-code generation, 2025. 2, 7, XI

work page 2025
[63]

Scaling text-rich image understanding via code- guided synthetic multimodal data generation, 2025

Yue Yang, Ajay Patel, Matt Deitke, Tanmay Gupta, Luca Weihs, Andrew Head, Mark Yatskar, Chris Callison-Burch, Ranjay Krishna, Aniruddha Kembhavi, and Christopher Clark. Scaling text-rich image understanding via code- guided synthetic multimodal data generation, 2025. 2, 3

work page 2025
[64]

Tinychart: Efficient chart understanding with visual token merging and program- of-thoughts learning, 2024

Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, and Fei Huang. Tinychart: Efficient chart understanding with visual token merging and program- of-thoughts learning, 2024. 2, 3

work page 2024
[65]

Gpt-4v(ision) as a generalist evalu- ator for vision-language tasks, 2023

Xinlu Zhang, Yujie Lu, Weizhi Wang, An Yan, Jun Yan, Lianke Qin, Heng Wang, Xifeng Yan, William Yang Wang, and Linda Ruth Petzold. Gpt-4v(ision) as a generalist evalu- ator for vision-language tasks, 2023. XI

work page 2023
[66]

Chartcoder: Advancing multimodal large language model for chart-to- code generation, 2025

Xuanle Zhao, Xianzhen Luo, Qi Shi, Chi Chen, Shuo Wang, Wanxiang Che, Zhiyuan Liu, and Maosong Sun. Chartcoder: Advancing multimodal large language model for chart-to- code generation, 2025. 2

work page 2025
[67]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Multimodal c4: An open, billion-scale corpus of images interleaved with text

Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal C4: An open, billion-scale corpus of images interleaved with text.arXiv preprint arXiv:2304.06939, 2023. 3 11 ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understand...

work page arXiv 2023
[69]

Identify which item has the highest metric value **and** the highest value range

work page
[70]

Which category/entity has the highest (or lowest) value, and by approximately how much does it differ from the next (or opposite) category?

Compare the position of that item to the item wi th the lowest metric value **and** the lowest value range, specifical ly in terms of how far each is from the plot’s center. From the image description we have a list of six ite ms with their (x, y) coordinates, where x = Metric Values and y = Value R ange. - item1: (60, 40) - item2: (20, 15) - item3: (20, ...

work page 2017
[71]

Missing or Incomplete Data: Is the chart blank or missing content? Are expected elements like bars, lines, or segments missing?

work page
[72]

Labeling Issues: Are axis labels clear, complete, and readable? Are category or tick labels truncated or overlapping?

work page
[73]

Legend Issues: Are legends accurate and consistent with the chart? Are legends readable? Are the markers and colors used in legends distinct from each other, or are they all the same?

work page
[74]

Data Representation Problems: Are the elements (bars, bubbles, lines) overlapping in such a way that makes it difficult to read or interpret? Are the colors or sizes misleading or unexplained?

work page
[75]

Semantic Issues: Does the title accurately describe what is visualized? Does the chart type match the data (e.g., don’t use violin plot visuals for scatter plots)? Do the segments (e.g., in pie charts) sum to 100% if they should?

work page
[76]

Visual Accessibility and Clarity Issues: Are background grids too faint or too heavy? Is the font size legible?

work page
[77]

Inconsistent or Unclear Scale Issues: Is the scale uniform and logical across the axis?

work page
[78]

Yes" or

Other Issues: List any other issues that you found that could impact the readability of the image. **ANSWER FORMAT: ** Respond in the following JSON format, where you first give a brief explanation for your evaluation and then either "Yes" or "No": ‘‘‘json { "1. Missing or Incomplete Data": [<Evaluation explanation>, <"Yes" | "No">], "2. Labeling Issues":...

work page
[79]

Where is the <element>?

work page
[80]

What is the <element>?

work page
[81]

Where are the <elements>?

work page

Showing first 80 references.