PlotChain benchmark reports top MLLMs reaching ~80% field-level accuracy on engineering plot reading under human-like tolerances, but with persistent failures on frequency-domain tasks like bandpass and FFT spectra.
Note on the sampling error of the difference between correlated proportions or percentages,
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2representative citing papers
Supervised models using 83 metrics achieve 0.85-0.9 recall for post-release Python faults, outperforming LLMs, with process metrics and code size most predictive and metrics plus embeddings capturing complementary information.
citing papers explorer
-
PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading
PlotChain benchmark reports top MLLMs reaching ~80% field-level accuracy on engineering plot reading under human-like tolerances, but with persistent failures on frequency-domain tasks like bandpass and FFT spectra.
-
Will It Break in Production? Metric-Driven Prediction of Residual Defects in Python Systems
Supervised models using 83 metrics achieve 0.85-0.9 recall for post-release Python faults, outperforming LLMs, with process metrics and code size most predictive and metrics plus embeddings capturing complementary information.