pith. sign in

arxiv: 2605.23694 · v1 · pith:INKBZ3E5new · submitted 2026-05-22 · 💻 cs.CL

ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models

Pith reviewed 2026-05-25 04:11 UTC · model grok-4.3

classification 💻 cs.CL
keywords chart descriptionsmultimodal large language modelsbenchmarkfaithfulness evaluationinsightfulnessevaluation metricsMLLM assessment
0
0 comments X

The pith

ChartFI-Bench introduces four dimensions and aligned metrics to evaluate how faithfully and insightfully MLLMs describe charts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs ChartFI-Bench to overcome limitations in prior datasets and metrics for assessing chart descriptions produced by multimodal large language models. It identifies four dimensions of quality—factual accuracy, salient feature emphasis, domain-informed guidance, and chart-text complementarity—and builds a set of 896 pairs featuring complex charts paired with rich descriptions. Four corresponding metrics then measure performance along these lines. Experiments apply the benchmark to mainstream MLLMs, confirming the framework works while exposing recurring shortcomings in the models. Readers should care because chart descriptions support accessibility and insight extraction, yet current evaluation methods do not reliably track those capabilities.

Core claim

Existing benchmarks use simple charts and shallow fact-listing descriptions, so they cannot adequately test MLLM output; ChartFI-Bench supplies 896 visually complex chart-description pairs built around the four quality dimensions, supplies four aligned metrics (Faithfulness, Coverage, Informativeness, Acuity), and shows through experiments that the new framework detects common model weaknesses.

What carries the argument

ChartFI-Bench dataset of 896 pairs together with the four metrics (Faithfulness, Coverage, Informativeness, Acuity) that operationalize the four quality dimensions.

If this is right

  • Existing MLLMs exhibit measurable shortfalls in factual accuracy, feature emphasis, domain guidance, and text-chart alignment when describing charts.
  • The four metrics enable systematic comparison of future models against the benchmark.
  • Improved chart descriptions can directly aid accessibility tools and cross-modal retrieval systems.
  • The benchmark construction process itself supplies a template for creating richer evaluation sets in related multimodal tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The metrics could be used as reward signals during fine-tuning to push models toward higher-quality descriptions.
  • The same dimension-based approach might transfer to evaluating descriptions of other data visualizations such as graphs or diagrams.
  • Deployment of the benchmark in accessibility pipelines would let developers track progress on real-world usefulness rather than proxy tasks.

Load-bearing premise

The four dimensions fully characterize what makes a chart description high-quality.

What would settle it

Human experts rating the same descriptions on the same charts produce rankings that systematically disagree with scores from the four proposed metrics.

Figures

Figures reproduced from arXiv: 2605.23694 by Chao Liu, Chunran Hu, Fen Wang, Lexu Xie, Qiman Kang, Siming Chen, Zekai Shao, Zhixuan Zhang.

Figure 1
Figure 1. Figure 1: Overview of the benchmark construction pipeline, consisting of three stages: dataset collection from academic papers, chart filtering with [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Statistics of the ChartFI-Bench: the left shows the number of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison and error analysis across methods on [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Chart descriptions are essential for accessibility, cross-modal retrieval, and assisting readers in extracting insights from complex visualizations. As multimodal large language models (MLLMs) are increasingly adopted for automated chart description generation, a critical question arises: how faithfully and insightfully do these models actually describe charts? Current benchmarks fall short on two fronts: existing datasets consist of simple, homogeneous charts paired with shallow, fact-enumerating descriptions; and prevailing metrics fail to capture the multi-faceted nature of description quality. To address these gaps, we present the Chart Faithfulness and Insightfulness Benchmark (ChartFI-Bench). We first summarize four dimensions that characterize high-quality chart descriptions: factual accuracy, salient feature emphasis, domain-informed guidance, and chart-text complementarity. Guided by these dimensions, we construct a high-quality benchmark comprising 896 chart-description pairs, which feature visually complex charts and semantically rich descriptions. Furthermore, we design four aligned evaluation metrics -- Faithfulness, Coverage, Informativeness, and Acuity -- to systematically assess the quality of descriptions across these dimensions. Experiments conducted on mainstream MLLMs demonstrate the effectiveness of the proposed framework and reveal common weaknesses among existing models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces ChartFI-Bench, a benchmark of 896 chart-description pairs featuring visually complex charts and semantically rich descriptions. It summarizes four dimensions claimed to characterize high-quality chart descriptions (factual accuracy, salient feature emphasis, domain-informed guidance, chart-text complementarity) that guide benchmark construction and the design of four aligned metrics (Faithfulness, Coverage, Informativeness, Acuity). Experiments on mainstream MLLMs are reported to demonstrate the framework's effectiveness and reveal common model weaknesses.

Significance. If the dimensions are justified and the metrics validated against human judgments, the benchmark could improve evaluation of MLLM chart descriptions beyond existing simple datasets and shallow metrics, supporting accessibility and insight extraction tasks. The scale and complexity of the pairs represent a concrete advance, but significance is reduced by the absence of validation for the guiding dimensions.

major comments (1)
  1. [Abstract] Abstract: The four dimensions are asserted to characterize high-quality chart descriptions and explicitly guide both benchmark construction and the four metrics, yet no derivation from prior visualization or accessibility literature, coverage argument, or validation (e.g., correlation with overall human quality ratings or inter-rater agreement) is supplied. This is load-bearing for the central claim that experiments demonstrate framework effectiveness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive comment. We address the concern regarding justification of the four dimensions below and commit to revisions that strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The four dimensions are asserted to characterize high-quality chart descriptions and explicitly guide both benchmark construction and the four metrics, yet no derivation from prior visualization or accessibility literature, coverage argument, or validation (e.g., correlation with overall human quality ratings or inter-rater agreement) is supplied. This is load-bearing for the central claim that experiments demonstrate framework effectiveness.

    Authors: We agree that the current presentation does not sufficiently derive or validate the four dimensions (factual accuracy, salient feature emphasis, domain-informed guidance, and chart-text complementarity). In the revised version we will add a new subsection (likely in Section 2 or 3) that explicitly grounds each dimension in prior visualization literature (e.g., Bertin’s visual variables, Tufte’s data-ink ratio and graphical excellence, and Cleveland & McGill’s perceptual rankings) as well as accessibility guidelines (W3C WCAG and chart-specific recommendations from the visualization accessibility community). We will also include a brief coverage argument showing how these dimensions collectively address gaps in existing chart-description evaluation. To address validation, we will conduct and report a small human study (n=30–50 raters) measuring correlation between the four dimension scores and overall quality ratings, plus inter-rater agreement (Cohen’s/Fleiss’ kappa). These additions will directly support the claim that the framework and experiments are effective. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark and metrics constructed independently

full rationale

The paper summarizes four dimensions and uses them to guide benchmark construction and metric design, but this does not constitute circularity under the defined patterns. There are no equations, fitted parameters renamed as predictions, self-citations that are load-bearing for the central claim, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation. The derivation chain consists of an asserted starting point (the dimensions) followed by independent construction of the dataset and metrics; no step reduces a result to its own inputs by construction. This is a standard benchmark paper whose claims rest on the new artifacts rather than self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; the central claim rests on the domain assumption that the four listed dimensions fully capture description quality and that the constructed pairs are representative.

axioms (1)
  • domain assumption The four dimensions (factual accuracy, salient feature emphasis, domain-informed guidance, chart-text complementarity) characterize high-quality chart descriptions.
    Abstract states these dimensions guided benchmark construction.

pith-pipeline@v0.9.0 · 5758 in / 1179 out tokens · 29902 ms · 2026-05-25T04:11:23.892290+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 5 internal anchors

  1. [1]

    https://www.statista.com/, 2026

    Statista. https://www.statista.com/, 2026. Accessed: 2026-3-31. 2

  2. [2]

    Ayres and J

    P. Ayres and J. Sweller. The split-attention principle in multimedia learning. The Cambridge handbook of multimedia learning, 2:135–146, 2005. 3

  3. [3]

    Y . Bai, Y . Ding, S. Lin, and W. Fan. Beyond description: A multi- modal agent framework for insightful chart summarization.arXiv preprint arXiv:2602.18731, 2026. 3

  4. [4]

    Banerjee and A

    S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72, 2005. 1, 7

  5. [5]

    Battle and A

    L. Battle and A. Ottley. What do we mean when we say “insight”? a formal synthesis of existing theory.IEEE Transactions on Visualization and Computer Graphics, 30(9):6075–6088, 2023. 3

  6. [6]

    H. P. Chan, Q. Zeng, and H. Ji. Interpretable automatic fine-grained incon- sistency detection in text summarization. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 6433–6444, 2023. 3

  7. [7]

    C. Chen, R. Zhang, E. Koh, S. Kim, S. Cohen, T. Yu et al. Figure captioning with reasoning and sequence-level training.arXiv preprint arXiv:1906.02850, 2019. 2

  8. [8]

    N. Chen, Y . Zhang, J. Xu, K. Ren, and Y . Yang. Viseval: A benchmark for data visualization in the era of large language models.IEEE Transactions on Visualization and Computer Graphics, 31(1):1301–1311, 2024. 1, 7, 8

  9. [9]

    C. B. Clement, M. Bierbaum, K. P. O’Keeffe, and A. A. Alemi. On the use of arxiv as a dataset.arXiv preprint arXiv:1905.00075, 2019. 3

  10. [10]

    A. K. Das, M. Tarun, and K. Mueller. Charts-of-thought: Enhancing llm vi- sualization literacy through structured data extraction.IEEE Transactions on Visualization and Computer Graphics, 2025. 5

  11. [11]

    H. Dong, J. Li, B. Wu, J. Wang, Y . Zhang, and H. Guo. Benchmarking and improving detail image caption.arXiv preprint arXiv:2405.19092,

  12. [12]

    Ellemose and N

    J. Ellemose and N. Elmqvist. Eye of the beholder: Towards measuring vi- sualization complexity.IEEE Transactions on Visualization and Computer Graphics, 2025. 3

  13. [13]

    X. Fu, Y . Wang, H. Dong, W. Cui, and H. Zhang. Visualization assessment: A machine learning approach. In2019 IEEE Visualization Conference (VIS), pp. 126–130. IEEE, 2019. 8

  14. [14]

    Gemini 3 pro

    Google. Gemini 3 pro. https://chatgpt.com/, 2026. Accessed: 2026- 3-31. 2, 4, 5, 7

  15. [15]

    Y . Han, C. Zhang, X. Chen, X. Yang, Z. Wang, G. Yu et al. Chartllama: A multimodal llm for chart understanding and generation.arXiv preprint arXiv:2311.16483, 2023. 2

  16. [16]

    Hessel, A

    J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y . Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pp. 7514–7528, 2021. 3

  17. [17]

    Hoque and M

    E. Hoque and M. S. Islam. Natural language generation for visualizations: State of the art, challenges and future directions. InComputer Graphics Forum, vol. 44, p. e15266. Wiley Online Library, 2025. 7

  18. [19]

    T.-Y . Hsu, C. L. Giles, and T.-H. Huang. Scicap: Generating captions for scientific figures. InFindings of the Association for Computational Linguistics: EMNLP 2021, pp. 3258–3264, 2021. 2

  19. [20]

    Hsu, C.-Y

    T.-Y . Hsu, C.-Y . Huang, R. Rossi, S. Kim, C. Giles, and T.-H. Huang. Gpt-4 as an effective zero-shot evaluator for scientific figure captions. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 5464–5474, 2023. 9

  20. [21]

    Huang, H

    K.-H. Huang, H. P. Chan, M. Fung, H. Qiu, M. Zhou, S. Joty et al. From pixels to insights: A survey on automatic chart understanding in the era of large foundation models.IEEE Transactions on Knowledge and Data Engineering, 37(5):2550–2568, 2024. 3

  21. [22]

    Huang, H

    K.-H. Huang, H. P. Chan, and H. Ji. Zero-shot faithful factual error correction. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5660–5676,

  22. [23]

    Huang, M

    K.-H. Huang, M. Zhou, H. P. Chan, Y . Fung, Z. Wang, L. Zhang et al. Do lvlms understand charts? analyzing and correcting factual errors in chart captioning. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 730–749, 2024. 9

  23. [24]

    Kantharaj, R

    S. Kantharaj, R. T. Leong, X. Lin, A. Masry, M. Thakkar, E. Hoque et al. Chart-to-text: A large-scale benchmark for chart summarization. pp. 4005–4023. Association for Computational Linguistics, 2022. doi: 10. 18653/v1/2022.acl-long.277 2

  24. [25]

    D. H. Kim, V . Setlur, and M. Agrawala. Towards understanding how readers integrate charts and captions: A case study with line charts. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–11, 2021. 3, 7

  25. [26]

    K. Kim, S. Lee, K.-H. Huang, H. P. Chan, M. Li, and H. Ji. Can llms pro- duce faithful explanations for fact-checking? towards faithful explainable fact-checking via multi-agent debate.arXiv preprint arXiv:2402.07401,

  26. [27]

    Kinney, C

    R. Kinney, C. Anastasiades, R. Authur, I. Beltagy, J. Bragg, A. Buraczyn- ski et al. The semantic scholar open data platform.arXiv preprint arXiv:2301.10140, 2023. 3

  27. [28]

    H.-K. Ko, H. Jeon, G. Park, D. H. Kim, N. W. Kim, J. Kim et al. Natural language dataset generation framework for visualizations powered by large language models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pp. 1–22, 2024. 2

  28. [29]

    Krichene, F

    S. Krichene, F. Piccinno, F. Liu, and J. Eisenschlos. Faithful chart summa- rization with chats-pi. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8705–8723, 2024. 3

  29. [30]

    Latif, Z

    S. Latif, Z. Zhou, Y . Kim, F. Beck, and N. W. Kim. Kori: Interactive synthesis of text and charts in data documents.IEEE Transactions on Visualization and Computer Graphics, 28(1):184–194, 2021. 1, 3

  30. [32]

    L. Li, Y . Wang, R. Xu, P. Wang, X. Feng, L. Kong et al. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14369–14387,

  31. [33]

    J. Lim, J. Ahn, and G. Kim. Chartcap: Mitigating hallucination of dense chart captioning. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pp. 13171–13182, 2025. 1, 2, 3

  32. [34]

    C.-Y . Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004. 7

  33. [35]

    C. Liu, Y . Guo, and X. Yuan. Autotitle: An interactive title generator for visualizations.IEEE Transactions on Visualization and Computer Graphics, 30(8):5276–5288, 2023. 3

  34. [36]

    F. Liu, X. Wang, W. Yao, J. Chen, K. Song, S. Cho et al. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies (Volume 1: Long Papers), pp. 1287–1310, 2024. 2

  35. [37]

    M. Liu, D. Chen, Y . Li, G. Fang, and Y . Shen. Chartthinker: A contex- tual chain-of-thought approach to optimized chart summarization.arXiv preprint arXiv:2403.11236, 2024. 2

  36. [38]

    Liu, C.-W

    Z. Liu, C.-W. Xie, B. Wen, F. Yu, J. Chen, P. Li et al. Capability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness.arXiv preprint arXiv:2502.14914, 2025. 1, 3

  37. [39]

    Y . Lu, L. Zhong, J. Yang, W. Li, P. Wei, Y . Wang et al. Domaincqa: Crafting knowledge-intensive qa from domain-specific charts. InPro- ceedings of the AAAI Conference on Artificial Intelligence, vol. 40, pp. 32347–32355, 2026. 7

  38. [40]

    Lundgard and A

    A. Lundgard and A. Satyanarayan. Accessible visualization via natural language descriptions: A four-level model of semantic content.IEEE transactions on visualization and computer graphics, 28(1):1073–1083,

  39. [41]

    Mahinpei, Z

    A. Mahinpei, Z. Kostic, and C. Tanner. Linecap: Line charts for data visualization captioning models. In2022 IEEE Visualization and Visual Analytics (VIS), pp. 35–39. IEEE, 2022. 2

  40. [42]

    F. Meng, W. Shao, Q. Lu, P. Gao, K. Zhang, Y . Qiao et al. Chartassis- stant: A universal chart multimodal language model via chart-to-table pre- training and multitask instruction tuning.arXiv preprint arXiv:2401.02384,

  41. [43]

    J. Niu, Z. Liu, Z. Gu, B. Wang, L. Ouyang, Z. Zhao et al. Mineru2.5: A decoupled vision-language model for efficient high-resolution document parsing, 2025. 3

  42. [44]

    Obeid and E

    J. Obeid and E. Hoque. Chart-to-text: Generating natural language de- scriptions for charts by adapting the transformer model. InProceedings of the 13th International Conference on Natural Language Generation, pp. 138–147, 2020. 1, 2, 3

  43. [45]

    Obeid and E

    J. Obeid and E. Hoque. Chart-to-text: Generating natural language de- scriptions for charts by adapting the transformer model.arXiv preprint arXiv:2010.09142, 2020. 2

  44. [46]

    Gemini 3 pro

    OpenAI. Gemini 3 pro. https://gemini.google.com/, 2026. Ac- cessed: 2026-3-31. 2, 4, 7

  45. [47]

    Papineni, S

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002. 1, 3, 7

  46. [48]

    Pew research center.https://www.pewresearch

    Pew Research Center. Pew research center.https://www.pewresearch. org/, 2026. Accessed: 2026-3-31. 2

  47. [49]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. 2, 5, 6, 7

  48. [50]

    Rahman, R

    R. Rahman, R. Hasan, A. Al Farhad, M. T. R. Laskar, M. H. Ashmafee, and A. R. M. Kamal. Chartsumm: A comprehensive benchmark for automatic chart summarization of long and short summaries. InCanadian AI, 2023. 2

  49. [51]

    E. R. RECALL. Beyond memorability: Visualization recognition and recall.IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 22(1), 2016. 8

  50. [52]

    Sellam, D

    T. Sellam, D. Das, and A. Parikh. Bleurt: Learning robust metrics for text generation. InProceedings of the 58th annual meeting of the association for computational linguistics, pp. 7881–7892, 2020. 7

  51. [53]

    Z. Shao, L. Shen, H. Li, Y . Shan, H. Qu, Y . Wang et al. Narrative player: Reviving data narratives with visuals.IEEE Transactions on Visualization and Computer Graphics, 31(10):6781–6795, 2025. 5

  52. [54]

    L. Shen, E. Shen, Z. Tai, Y . Xu, J. Dong, and J. Wang. Visual data analysis with task-based recommendations.Data Science and Engineering, 7(4):354–369, 2022. 5

  53. [55]

    Y . Shi, C. Zheng, Z. Yang, K. Xu, and N. Cao. Vistoryteller: Designing data stories with llm agent-based generation and interactive user control. InProceedings of the 31st International Conference on Intelligent User Interfaces, pp. 1141–1156, 2026. 1, 2

  54. [56]

    Stokes, V

    C. Stokes, V . Setlur, B. Cogley, A. Satyanarayan, and M. A. Hearst. Strik- ing a balance: Reader takeaways and preferences when integrating text and charts.IEEE Transactions on Visualization and Computer Graphics, 29(1):1233–1243, 2022. 3

  55. [57]

    Sultanum and A

    N. Sultanum and A. Srinivasan. Datatales: Investigating the use of large language models for authoring data-driven articles. In2023 IEEE Visual- ization and Visual Analytics (VIS), pp. 231–235. IEEE, 2023. 2

  56. [58]

    J. Sweller. Implications of cognitive load theory for multimedia learning. The Cambridge handbook of multimedia learning, 3(2):19–30, 2005. 3

  57. [59]

    B. Tang, A. Boggust, and A. Satyanarayan. Vistext: A benchmark for semantically rich chart captioning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7268–7298, 2023. 1, 2, 3

  58. [60]

    Vedantam, C

    R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus- based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575, 2015. 3

  59. [61]

    A. Z. Wang, G. J. Quadri, M. Zhu, C. Tseng, and D. A. Szafir. Characteriz- ing visualization perception with psychological phenomena: Uncovering the role of subitizing in data visualization.IEEE Transactions on Visual- ization and Computer Graphics, 2025. 3

  60. [62]

    F. Wang, B. Wang, X. Shu, Z. Liu, Z. Shao, C. Liu et al. Chartinsighter: An approach for mitigating hallucination in time-series chart summary generation with a benchmark dataset.IEEE transactions on visualization and computer graphics, 2025. 1, 2, 3

  61. [63]

    W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 2, 5, 7

  62. [64]

    Y . Wang, Z. Sun, H. Zhang, W. Cui, K. Xu, X. Ma et al. Datashot: Automatic generation of fact sheets from tabular data.IEEE transactions on visualization and computer graphics, 26(1):895–905, 2019. 5

  63. [65]

    Z. Wang, M. Xia, L. He, H. Chen, Y . Liu, R. Zhu et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024. 7

  64. [66]

    Wiseman, S

    S. Wiseman, S. M. Shieber, and A. M. Rush. Challenges in data-to- document generation. InProceedings of the 2017 conference on empirical methods in natural language processing, pp. 2253–2263, 2017. 3

  65. [67]

    R. Xia, H. Ye, X. Yan, Q. Liu, H. Zhou, Z. Chen et al. Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart rea- soning.IEEE Transactions on Image Processing, 2025. 1

  66. [68]

    R. Ye. Chartdiff: A large-scale benchmark for comprehending pairs of charts.arXiv preprint arXiv:2603.28902, 2026. 2

  67. [69]

    BERTScore: Evaluating Text Generation with BERT

    T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,