arxiv: 2602.13232 · v1 · submitted 2026-01-29 · 💻 cs.AI · cs.SE

Recognition: no theorem link

PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

Mayank Ravishankara

Authors on Pith no claims yet

Pith reviewed 2026-05-16 10:22 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords multimodal LLMsengineering plotsbenchmarkplot readingcheckpoint evaluationBode plotsFFTdeterministic benchmark

0 comments

The pith

PlotChain introduces a deterministic generator-based benchmark that scores multimodal LLMs on recovering exact quantitative values from 15 families of engineering plots with intermediate checkpoints for sub-skill diagnosis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes PlotChain as a reproducible evaluation suite for multimodal large language models on engineering plot reading, generating 450 plots from known parameters across families such as Bode, FFT, step response, and stress-strain curves. Each plot includes exact ground-truth values plus checkpoint fields that isolate specific reading tasks like identifying cutoff frequency or peak magnitude. Under a fixed protocol of temperature zero and strict JSON numeric output, the top models reach 78-80 percent field-level pass rates with human-style tolerances, while GPT-4o lags and frequency-domain tasks prove especially difficult. This setup moves beyond OCR or captioning to test precise numeric extraction and allows failure localization within each plot family.

Core claim

PlotChain supplies 15 plot families with 450 rendered plots produced from known parameters, each paired with exact ground truth computed directly from the generator and augmented by intermediate checkpoint fields that isolate sub-skills such as reading cutoff frequency or peak magnitude. Under standardized zero-temperature JSON-only evaluation and per-field tolerances that approximate human plot-reading precision, Gemini 2.5 Pro, GPT-4.1, and Claude Sonnet 4.5 reach overall field-level pass rates of 80.42 percent, 79.84 percent, and 78.21 percent respectively, while GPT-4o reaches 61.59 percent; frequency-domain tasks remain weak, with bandpass response at or below 23 percent.

What carries the argument

The checkpoint-based diagnostic evaluation using intermediate 'cp_' fields that isolate sub-skills within each plot family and enable precise localization of model failures.

If this is right

Models can be diagnosed for specific weaknesses such as frequency-domain extraction rather than receiving only an aggregate score.
Frequency-domain plots like bandpass response and FFT spectra remain substantially harder than time-domain or static curves for current multimodal models.
The released generator, dataset, raw outputs, and scoring code enable fully reproducible runs and allow rescoring under any alternative tolerance policy.
Checkpoint fields support targeted improvement of sub-skills without requiring new full-plot evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future model training could incorporate synthetic plots generated in the same style to close the performance gap on frequency tasks.
The deterministic checkpoint structure could be applied to other visual reasoning domains such as circuit diagrams or mechanical drawings.
Releasing the exact generation parameters allows independent verification that no test leakage occurred during model training.

Load-bearing premise

The 15 selected plot families and the chosen per-field tolerances accurately capture the precision required for real engineering plot-reading tasks.

What would settle it

Running the released generator and scoring code on the same 450 plots with the identical zero-temperature JSON protocol and obtaining field-level pass rates that differ by more than a few percentage points from the reported figures for any of the four models.

Figures

Figures reproduced from arXiv: 2602.13232 by Mayank Ravishankara.

**Figure 1.** Figure 1: Representative PlotChain samples (one per family). Row 1: Step response; Bode magnitude; Bode phase; Bandpass [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Family-level performance heatmap (final-field pass rate, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Headline ranking by item-level strict all-pass (all final [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Final vs. checkpoint field pass rates by model under [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

We present PlotChain, a deterministic, generator-based benchmark for evaluating multimodal large language models (MLLMs) on engineering plot reading-recovering quantitative values from classic plots (e.g., Bode/FFT, step response, stress-strain, pump curves) rather than OCR-only extraction or free-form captioning. PlotChain contains 15 plot families with 450 rendered plots (30 per family), where every item is produced from known parameters and paired with exact ground truth computed directly from the generating process. A central contribution is checkpoint-based diagnostic evaluation: in addition to final targets, each item includes intermediate 'cp_' fields that isolate sub-skills (e.g., reading cutoff frequency or peak magnitude) and enable failure localization within a plot family. We evaluate four state-of-the-art MLLMs under a standardized, deterministic protocol (temperature = 0 and a strict JSON-only numeric output schema) and score predictions using per-field tolerances designed to reflect human plot-reading precision. Under the 'plotread' tolerance policy, the top models achieve 80.42% (Gemini 2.5 Pro), 79.84% (GPT-4.1), and 78.21% (Claude Sonnet 4.5) overall field-level pass rates, while GPT-4o trails at 61.59%. Despite strong performance on many families, frequency-domain tasks remain brittle: bandpass response stays low (<= 23%), and FFT spectrum remains challenging. We release the generator, dataset, raw model outputs, scoring code, and manifests with checksums to support fully reproducible runs and retrospective rescoring under alternative tolerance policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PlotChain gives a clean reproducible benchmark for MLLMs on engineering plot reading with built-in checkpoints, but the reported scores rest on tolerances without any human calibration.

read the letter

PlotChain stands out for giving a deterministic, generator-based benchmark specifically for pulling quantitative values from engineering plots like Bode plots, FFT spectra, and stress-strain curves. The checkpoint diagnostics are a nice addition for seeing exactly where models break down. The paper does a few things well. It generates 450 plots across 15 families from known parameters, computes exact ground truth directly from the code, and releases the full generator, dataset, model outputs, and scoring code with checksums. That level of reproducibility is rare and helpful. They run a standardized protocol with temperature zero and JSON output, then score with per-field tolerances. The main limitation is the lack of any human baseline for those tolerances. The tolerances are described as reflecting human plot-reading precision, but without expert annotations or inter-rater checks on the plots themselves, it's hard to know if the reported 80% pass rates for Gemini 2.5 Pro and similar models represent strong performance or just the choice of window sizes. This affects how seriously to take the brittleness findings on bandpass and FFT tasks. The generator ensures correct answers exist, but the evaluation thresholds need grounding. This work is for researchers focused on multimodal models in technical domains. Anyone building evaluation suites or testing models on engineering tasks will find the setup useful. It deserves peer review because the benchmark and release make it a solid addition even with the calibration gap.

Referee Report

1 major / 1 minor

Summary. The paper introduces PlotChain, a deterministic generator-based benchmark for evaluating multimodal LLMs on quantitative engineering plot reading. It comprises 15 plot families with 450 rendered plots, each paired with exact ground truth computed from the generating process and checkpointed intermediate fields for sub-skill diagnosis. Four state-of-the-art MLLMs are tested under a fixed temperature-0, JSON-only protocol using per-field 'plotread' tolerances, yielding overall field-level pass rates of 80.42% (Gemini 2.5 Pro), 79.84% (GPT-4.1), 78.21% (Claude Sonnet 4.5), and 61.59% (GPT-4o), with brittleness noted in frequency-domain tasks such as bandpass response (<=23%) and FFT spectrum.

Significance. If the tolerances are validated as matching human precision, the benchmark would provide a useful, fully reproducible framework for assessing MLLM performance on a practical engineering skill, with checkpointing enabling precise failure localization. The explicit release of the generator, dataset, raw outputs, scoring code, and checksummed manifests is a clear strength that supports retrospective rescoring and community extensions.

major comments (1)

[Evaluation Protocol] The 'plotread' tolerance policy is presented as reflecting human plot-reading precision, yet the manuscript provides no human expert annotations, inter-rater agreement metrics, or calibration study on the 450 plots. This is load-bearing for the central claims because the reported pass rates (e.g., 80.42% for Gemini 2.5 Pro) and the identification of frequency-domain brittleness (bandpass <=23%, FFT challenges) depend directly on the chosen numeric windows; altering any tolerance band would change rankings and conclusions.

minor comments (1)

[Abstract] Exact prompt templates and full exclusion rules are not visible in the abstract and should be included in the main text or appendix to support full reproducibility.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for highlighting the importance of validating the 'plotread' tolerances. We address this major comment below and outline planned revisions.

read point-by-point responses

Referee: The 'plotread' tolerance policy is presented as reflecting human plot-reading precision, yet the manuscript provides no human expert annotations, inter-rater agreement metrics, or calibration study on the 450 plots. This is load-bearing for the central claims because the reported pass rates (e.g., 80.42% for Gemini 2.5 Pro) and the identification of frequency-domain brittleness (bandpass <=23%, FFT challenges) depend directly on the chosen numeric windows; altering any tolerance band would change rankings and conclusions.

Authors: We agree that the absence of a human calibration study or inter-rater metrics is a genuine limitation. The tolerances were set heuristically based on standard engineering plot-reading practices (e.g., 5% relative error for most continuous fields such as frequencies and gains, with absolute tolerances for near-zero values), but they were not empirically validated against human experts. In the revised manuscript we will: (1) explicitly describe the tolerances as heuristic rather than human-validated; (2) add an appendix table listing the exact tolerance formula and parameters for every field; (3) highlight that the released scoring code and manifests allow any user to recompute all scores under alternative tolerance policies; and (4) include a new sensitivity analysis showing how pass rates and the frequency-domain brittleness conclusion change under relaxed or tightened bands. We cannot, however, add human annotations or a calibration study without conducting new experiments that lie outside the scope of the current work. revision: partial

standing simulated objections not resolved

Human expert annotations, inter-rater agreement metrics, or calibration study on the 450 plots

Circularity Check

0 steps flagged

No significant circularity; benchmark uses external generator ground truth

full rationale

The paper presents an empirical evaluation benchmark. Plots are generated from explicit parameters, ground truth is computed directly from the generating process, and model outputs are compared to this independent GT under a fixed tolerance policy. No equations, derivations, or predictions reduce the reported pass rates to fitted inputs or self-referential definitions. The protocol is deterministic, fully specified, and released with code and data for external reproduction. No load-bearing self-citations or ansatzes are invoked for the core claims.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the selected plot families and tolerance policy capture representative engineering reading tasks. No new physical entities are postulated; the work relies on standard evaluation assumptions and released artifacts.

free parameters (1)

plotread tolerance values
Specific numeric tolerances per field chosen to approximate human precision; exact values not stated in abstract but used for all scoring.

axioms (1)

domain assumption Temperature=0 and strict JSON schema produce deterministic, parseable model outputs
Invoked in the standardized evaluation protocol to ensure consistent scoring across runs.

pith-pipeline@v0.9.0 · 5596 in / 1292 out tokens · 26072 ms · 2026-05-16T10:22:03.635044+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 5 internal anchors

[1]

FigureQA: An Annotated Figure Dataset for Visual Reasoning

S. E. Kahou, V . Michalski, A. Atkinson, ´A. K ´ad´ar, A. Trischler, and Y . Bengio, “Figureqa: An annotated figure dataset for visual reasoning,”arXiv preprint arXiv:1710.07300, 2017. [Online]. Available: https://arxiv.org/abs/1710.07300

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

Dvqa: Understanding data visualizations via question answering,

K. Kafle, B. L. Price, S. Cohen, and C. Kanan, “Dvqa: Understanding data visualizations via question answering,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5648–5656. [Online]. Available: https://openaccess.thecvf.com/content cvpr 2018/ papers/Kafle DVQA Understanding Data CVPR 2018 paper.pdf

work page 2018
[3]

Plotqa: Reasoning over scientific plots,

N. Methani, P. Ganguly, M. M. Khapra, and P. Kumar, “Plotqa: Reasoning over scientific plots,” inProceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 1516–1525. [Online]. Available: https://openaccess.thecvf.com/content W ACV2020/papers/Methani PlotQA Reasoning over Scientific Plots W ACV2020 paper.pdf

work page 2020
[4]

A benchmark for question answering about charts with visual and logical reasoning,

A. Masry, D. X. Long, J. Q. Tan, S. R. Joty, and E. Hoque, “A benchmark for question answering about charts with visual and logical reasoning,” inFindings of the Association for Computational Linguistics: ACL 2022, 2022. [Online]. Available: https://aclanthology. org/2022.findings-acl.177/

work page 2022
[5]

Chartbench: A benchmark for multimodal chart understanding,

Xuet al., “Chartbench: A benchmark for multimodal chart understanding,”arXiv preprint arXiv:2312.15915, 2023. [Online]. Available: https://arxiv.org/abs/2312.15915

work page arXiv 2023
[6]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao, “Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts,” inInternational Conference on Learning Representations (ICLR), 2024. [Online]. Available: https://arxiv.org/abs/2310.02255

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

X. Yueet al., “MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [Online]. Available: https://arxiv.org/abs/2311.16502

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program),

J. Pineau, P. Vincent-Lamarre, K. Sinha, V . Larivi `ere, A. Beygelzimer, F. d’Alch´e Buc, E. Fox, and H. Larochelle, “Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program),”Journal of Machine Learning Research, vol. 22, no. 164, pp. 1–20, 2021. [Online]. Available: https: //jmlr.org/papers/volume2...

work page 2019
[9]

Holistic Evaluation of Language Models

P. Liang, R. Bommasani, T. Lee, D. Tsipraset al., “Holistic evaluation of language models,”arXiv preprint arXiv:2211.09110, 2022. [Online]. Available: https://arxiv.org/abs/2211.09110

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Answering questions about charts and generating visual explanations,

D. H. Kim, E. Hoque, and M. Agrawala, “Answering questions about charts and generating visual explanations,” inProceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ’20), 2020. [Online]. Available: https://dl.acm.org/doi/10.1145/3313831. 3376467

work page doi:10.1145/3313831 2020
[11]

Chartsense: Interactive data extraction from chart images,

D. Jung, W. Kim, H. Song, J.-i. Hwang, B. Kim, and J. Seo, “Chartsense: Interactive data extraction from chart images,” in Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI ’17), 2017, pp. 6706–6717. [Online]. Available: https://dl.acm.org/doi/10.1145/3025453.3025957

work page doi:10.1145/3025453.3025957 2017
[12]

Figureseer: Parsing result-figures in research papers,

N. Siegel, S. Kornblith, T. Chen, C. Castillo, L. Bourdev, S. Gupta, R. Girshick, and A. Farhadi, “Figureseer: Parsing result-figures in research papers,” inEuropean Conference on Computer Vision (ECCV) Workshops, 2016. [Online]. Available: https://ai2-website.s3. amazonaws.com/publications/Siegel16eccv.pdf

work page 2016
[13]

Chartocr: Data extraction from charts images via a deep hybrid framework,

J. Luo, Z. Li, J. Wang, and C.-Y . Lin, “Chartocr: Data extraction from charts images via a deep hybrid framework,” inProceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2021. [Online]. Available: https://openaccess.thecvf.com/ content/W ACV2021/papers/LuoChartOCR Data Extraction From Charts Images via a Deep Hybrid W ACV2...

work page 2021
[14]

Chartreader: A unified framework for chart derendering and comprehension without manual rule-making,

Z.-Q. Cheng, S. Zhu, C. Sun, D. Li, J. Luo, and J. Liu, “Chartreader: A unified framework for chart derendering and comprehension without manual rule-making,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. [Online]. Available: https://openaccess.thecvf.com/content/ICCV2023/papers/ Cheng ChartReader A Unified Framew...

work page 2023
[15]

Pix2struct: Screenshot parsing as pretraining for visual language understanding,

K. Lee, M. Joshi, I. Turc, H. Hu, F. Liu, J. Eisenschloset al., “Pix2struct: Screenshot parsing as pretraining for visual language understanding,” in Proceedings of the 40th International Conference on Machine Learning (ICML), 2023. [Online]. Available: https://arxiv.org/abs/2210.03347

work page arXiv 2023
[16]

Unichart: A universal vision-language pretrained model for chart comprehension and reasoning,

A. Masry, P. Kavehzadeh, J. Q. Tan, S. R. Joty, and E. Hoque, “Unichart: A universal vision-language pretrained model for chart comprehension and reasoning,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. [Online]. Available: https://aclanthology.org/2023.emnlp-main.906/

work page 2023
[17]

Deplot: One-shot visual language reasoning by plot-to-table translation,

F. Liu, F. Piccinno, S. Krichene, J. Eisenschloset al., “Deplot: One-shot visual language reasoning by plot-to-table translation,” inFindings of the Association for Computational Linguistics: ACL 2023, 2023. [Online]. Available: https://arxiv.org/abs/2212.10505

work page arXiv 2023
[18]

Do lvlms understand charts? analyzing and correcting chart reasoning failures,

K.-H. Huanget al., “Do lvlms understand charts? analyzing and correcting chart reasoning failures,”Findings of the Association for Computational Linguistics: ACL 2024, 2024. [Online]. Available: https://aclanthology.org/2024.findings-acl.41/

work page 2024
[19]

Chart-to-text: Generating natural language descriptions for charts by adapting the transformer model,

J. Obeid and E. Hoque, “Chart-to-text: Generating natural language descriptions for charts by adapting the transformer model,” in Proceedings of the 13th International Conference on Natural Language Generation (INLG), 2020. [Online]. Available: https: //aclanthology.org/2020.inlg-1.20/

work page 2020
[20]

A large-scale benchmark for chart summarization,

S. Kantharaj, A. Masry, S. R. Joty, E. Hoqueet al., “A large-scale benchmark for chart summarization,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022. [Online]. Available: https://aclanthology.org/2022.acl-long.277/

work page 2022
[21]

Chartx & chartvlm: A versatile benchmark and foundation model for chart understanding,

Xiaet al., “Chartx & chartvlm: A versatile benchmark and foundation model for chart understanding,”arXiv preprint arXiv:2402.12185, 2024. [Online]. Available: https://arxiv.org/abs/2402.12185

work page arXiv 2024
[22]

MMBench: Is Your Multi-modal Model an All-around Player?

Y . Liu, H. Duan, Y . Huang, J. Wanget al., “Mmbench: Is your multi-modal model an all-around player?” inEuropean Conference on Computer Vision (ECCV), 2024. [Online]. Available: https: //arxiv.org/abs/2307.06281

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

The artificial intelligence cognitive examination: A survey on the evolution of multimodal evalu- ation from recognition to reasoning,

M. Ravishankara and V . V . Persad Maharaj, “The artificial intelligence cognitive examination: A survey on the evolution of multimodal evalu- ation from recognition to reasoning,”IEEE Access, pp. 1–1, 2025

work page 2025
[24]

SciPy 1.0: Fundamental algorithms for scientific computing in Python,

P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, ˙I. Polat, Y . Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen...

work page 2020
[25]

Array programming with NumPy,

C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del R ´ıo, M. Wiebe, P. Peterson, P. G´erard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant, “Array progra...

work page arXiv 2020
[26]

Matplotlib: A 2d graphics environment,

J. D. Hunter, “Matplotlib: A 2d graphics environment,”Computing in Science & Engineering, vol. 9, no. 3, pp. 90–95, 2007

work page 2007
[27]

Note on the sampling error of the difference between correlated proportions or percentages,

Q. McNemar, “Note on the sampling error of the difference between correlated proportions or percentages,”Psychometrika, vol. 12, no. 2, pp. 153–157, 1947

work page 1947
[28]

A simple sequentially rejective multiple test procedure,

S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian Journal of Statistics, vol. 6, no. 2, pp. 65–70, 1979

work page 1979
[29]

Efron and R

B. Efron and R. J. Tibshirani,An Introduction to the Bootstrap. Chap- man and Hall/CRC, 1993

work page 1993
[30]

Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed

J. Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Lawrence Erlbaum Associates, 1988

work page 1988