pith. machine review for the scientific record. sign in

arxiv: 2604.28039 · v1 · submitted 2026-04-30 · 💻 cs.AI

Recognition: unknown

SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

Authors on Pith no claims yet

Pith reviewed 2026-05-07 07:30 UTC · model grok-4.3

classification 💻 cs.AI
keywords SpecVQAspectral imagesvisual question answeringmultimodal modelsscientific benchmarkscurve data processingdomain reasoning
0
0 comments X

The pith

SpecVQA supplies 3100 expert questions on 620 spectral figures to test multimodal models on scientific spectra.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to build and validate SpecVQA as a benchmark that measures how well multimodal large language models can read and reason about scientific spectra, a common but dense form of experimental imagery. It assembles 620 figures drawn from peer-reviewed papers across seven spectrum types together with 3100 question-answer pairs written by domain experts; the pairs require both direct extraction of values from the curves and application of field-specific knowledge. A sampling-plus-interpolation method is introduced to shrink the token count of each spectrum while keeping the essential shape intact, and ablation results indicate that this step lifts model performance on the benchmark. The authors then run leading multimodal models through the full set of questions and publish a leaderboard that exposes current gaps. If the benchmark is sound, it supplies a concrete yardstick for progress toward AI systems that can assist with real laboratory data analysis.

Core claim

SpecVQA is a benchmark of 620 figures and 3100 expert-annotated QA pairs spanning seven representative spectrum types, constructed to evaluate multimodal models on both direct information extraction from spectral curves and domain-specific scientific reasoning. The work also presents a spectral data sampling and interpolation reconstruction technique that shortens input sequences while retaining curve characteristics, with ablations confirming measurable accuracy gains when the method is applied.

What carries the argument

The SpecVQA benchmark dataset together with the spectral data sampling and interpolation reconstruction method that compresses curve information for multimodal model input.

If this is right

  • Models that improve on SpecVQA can be expected to handle a wider range of experimental plots with fewer errors in value extraction and interpretation.
  • The sampling and interpolation technique can be reused on other high-resolution curve data to lower token budgets without retraining the underlying model.
  • The published leaderboard offers an immediate ranking that future multimodal models can be compared against on the same spectral tasks.
  • Development of domain-adapted models can use the 3100 QA pairs as supervised training data for spectral reasoning.
  • The benchmark structure, with separate direct-extraction and reasoning questions, makes it possible to diagnose whether a model fails at reading the image or at applying the science.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A natural next step is to enlarge the benchmark with additional spectrum types or with time-series and multi-panel figures that appear in the same papers.
  • The same sampling method could be tested on non-spectral scientific images such as diffraction patterns or chromatograms to check whether the compression benefit generalizes.
  • If models trained or evaluated on SpecVQA are later deployed on live instrument data, their error patterns on the benchmark may predict the kinds of mistakes they make in a laboratory setting.
  • The expert annotation process itself could be turned into a protocol that other scientific communities use to create similar narrow-domain visual-question benchmarks.

Load-bearing premise

The expert-written questions and answers correctly capture both the visual facts in each spectrum and the scientific reasoning steps needed to answer them, and the sampling step does not alter the curves in ways that change the correct answers.

What would settle it

A controlled test in which new spectra and questions are drawn from recent papers not used in the benchmark and a model that scores high on SpecVQA produces systematically wrong answers on the fresh set while the interpolation method is held fixed.

Figures

Figures reproduced from arXiv: 2604.28039 by Changhong Chen, Han Lyu, Hanzheng Li, Haoyi Tao, Jialu Shen, Nan Wang, Suyang Zhong, Xi Fang.

Figure 1
Figure 1. Figure 1: Overview of data curation and the proposed SpecVQA benchmark. Stage 1 automatically harvests high-quality figure–caption pairs from peer-reviewed journals. Stage 2 performs multi-stage filtering and spectrum-type classification, yielding a cu￾rated set of 20k spectrum–caption pairs under expert supervision. Stage 3 leverages MLLMs to distill QA pairs from the curated figures; domain experts then re-annotat… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of the noisy real data curve, sampled points, and the reconstructed smooth curve. Four examples from the experiments are presented. the smoothed curve. RDP is a line simplification algorithm that recursively iden￾tifies and keeps only the most crucial points required to approximate the curve within a specified maximum distance threshold ϵ. After this step, the sampled points inherently encode th… view at source ↗
Figure 3
Figure 3. Figure 3: Two examples from the Scientific QA part of the SpecVQA benchmark. Each example contains one spectral figure and five QA pairs in different languages. Here we show the English QA pairs. 4.4 Underlying Benchmark Construction Based on the 20k selected real spectral images, we generated a total of 200k corresponding image datasets across the seven categories. We also designed QA pairs that samples key points … view at source ↗
Figure 4
Figure 4. Figure 4: Comparison between real spectral images and generated spectral images. The first row shows real images, while the second row shows generated spectral images of the same categories. serving as the absolute baseline) and ours with the sampling strategy (Qwen3-VL-4B trained on 20k QA data and 200k sampling underlying data). The benchmark data does not overlap with the training set. This experimental phase of … view at source ↗
Figure 1
Figure 1. Figure 1: Case 1 Q: In panel a, how does the bandwidth of the absorption peak at approxi￾mately 350 nm change over time (from day 0 to day 28)? view at source ↗
Figure 2
Figure 2. Figure 2: Case 2 Q: In Figure a, at approximately which wavelength is the absorption edge of the Co-CN sample (green curve) located? A: Approximately 380 nm. GPT-5: 450 nm. Gemini-2.5-pro: 500 nm. Qwen3-VL-32B-Instruct: Not visible. 1.3 Flawed Mechanistic Reasoning Logical inconsistencies when connecting visual evidence to scientific principles, resulting in incorrect structure elucidation, phase identification, or … view at source ↗
Figure 3
Figure 3. Figure 3: Case 3 1.4 Underlying Task Whether learning from or predicting these points, MLLMs will tokenize them into a longer sequence compared with those in other tasks, which might impose heavy computational and memory burdens view at source ↗
Figure 4
Figure 4. Figure 4: Case 4 view at source ↗
read the original abstract

Spectra are a prevalent yet highly information-dense form of scientific imagery, presenting substantial challenges to multimodal large language models (MLLMs) due to their unstructured and domain-specific characteristics. Here we introduce SpecVQA, a professional scientific-image benchmark for evaluating multimodal models on scientific spectral understanding, covering 7 representative spectrum types with expert-annotated question-answer pairs. The aim comprises two aspects: spectra scientific QA evaluation and corresponding underlying task evaluation. SpecVQA contains 620 figures and 3100 QA pairs curated from peer-reviewed literature, targeting both direct information extraction and domain-specific reasoning. To effectively reduce token length while preserving essential curve characteristics, we propose a spectral data sampling and interpolation reconstruction approach. Ablation studies further confirm that the approach achieves substantial performance improvements on the proposed benchmark. We test the capability of prominent MLLMs in scientific spectral understanding on our benchmark and present a leaderboard. This work represents an essential step toward enhancing spectral understanding in multimodal large models and suggests promising directions for extending visual-language models to broader scientific research and data analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces SpecVQA, a benchmark of 620 scientific spectral figures and 3100 expert-annotated QA pairs drawn from peer-reviewed literature across 7 spectrum types. The benchmark targets both direct information extraction and domain-specific reasoning in spectra. The authors also propose a spectral sampling-plus-interpolation reconstruction method intended to reduce token length for MLLMs while preserving essential curve characteristics; ablation studies are reported to demonstrate substantial performance gains from this preprocessing, and a leaderboard is presented for several prominent MLLMs.

Significance. If the benchmark curation is reliable and the sampling/interpolation step demonstrably preserves the fine spectral features required by the QA pairs, the work would provide a useful new evaluation resource for multimodal models on information-dense scientific imagery. It could help identify and drive improvements in MLLM handling of domain-specific visual reasoning, an area where current models are known to struggle.

major comments (3)
  1. [§3] §3 (Spectral Sampling and Interpolation Reconstruction): The manuscript asserts that the proposed sampling and interpolation preserves 'essential curve characteristics,' yet provides no quantitative verification (e.g., pre/post comparison of peak locations, FWHM, integrated areas, or shoulder positions) that narrow features survive the process. Because many of the 3100 QA pairs require precise extraction or domain-specific reasoning, any systematic distortion would invalidate the claim that ablation gains reflect genuine spectral understanding rather than artifacts of the modified input.
  2. [§5] §5 (Experiments and Ablation Studies): The ablation results are described only at a high level; the text does not report data splits, whether the same token budget is enforced for all baselines, statistical significance of the reported gains, or error bars. Without these details it is impossible to determine whether the 'substantial performance improvements' are robust or sensitive to post-hoc choices in sampling parameters or model prompting.
  3. [§2.2] §2.2 (Dataset Curation): The description of how the 3100 expert-annotated QA pairs were generated does not include inter-annotator agreement statistics or a breakdown of question types (direct extraction vs. reasoning). This information is load-bearing for assessing whether the benchmark faithfully captures the intended capabilities.
minor comments (3)
  1. [Abstract] The abstract states that the benchmark covers '7 representative spectrum types' but the main text does not provide an explicit enumerated list or figure showing the distribution across types.
  2. [Figures] Figure captions for the example spectra should include the original resolution and the sampling parameters used in the reconstruction pipeline to allow readers to assess potential information loss.
  3. [§5] The leaderboard table would benefit from an additional column reporting the average token length after preprocessing for each model.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the thorough review and valuable suggestions that will help strengthen our paper. Below we respond to each major comment in turn, indicating the changes we will make to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Spectral Sampling and Interpolation Reconstruction): The manuscript asserts that the proposed sampling and interpolation preserves 'essential curve characteristics,' yet provides no quantitative verification (e.g., pre/post comparison of peak locations, FWHM, integrated areas, or shoulder positions) that narrow features survive the process. Because many of the 3100 QA pairs require precise extraction or domain-specific reasoning, any systematic distortion would invalidate the claim that ablation gains reflect genuine spectral understanding rather than artifacts of the modified input.

    Authors: We acknowledge the referee's concern regarding the lack of direct quantitative validation for feature preservation. Although the ablation studies show consistent improvements across different MLLMs and question types, suggesting the method aids in retaining useful information, we agree that explicit metrics are necessary to rule out distortions. In the revised manuscript, we will add quantitative comparisons using peak detection to measure preservation of peak locations (mean absolute error), FWHM (ratio), integrated areas (percentage difference), and shoulder positions on a sampled set of spectra. This will be presented in a new table or figure in §3. revision: yes

  2. Referee: [§5] §5 (Experiments and Ablation Studies): The ablation results are described only at a high level; the text does not report data splits, whether the same token budget is enforced for all baselines, statistical significance of the reported gains, or error bars. Without these details it is impossible to determine whether the 'substantial performance improvements' are robust or sensitive to post-hoc choices in sampling parameters or model prompting.

    Authors: We thank the referee for pointing out these omissions in the experimental details. As SpecVQA is a benchmark for evaluation rather than training, there are no traditional data splits; however, we will report results broken down by spectrum type and question category. We will clarify that all compared methods operate under the same token budget constraint by using equivalent numbers of sampled points. In the revision, we will include error bars based on standard deviation over multiple prompt templates and report p-values from statistical tests (e.g., Wilcoxon signed-rank test) for the performance differences. Sampling parameters will be detailed with sensitivity analysis. revision: yes

  3. Referee: [§2.2] §2.2 (Dataset Curation): The description of how the 3100 expert-annotated QA pairs were generated does not include inter-annotator agreement statistics or a breakdown of question types (direct extraction vs. reasoning). This information is load-bearing for assessing whether the benchmark faithfully captures the intended capabilities.

    Authors: We agree that these details are crucial for establishing benchmark quality. In the revised §2.2, we will report inter-annotator agreement statistics computed on a double-annotated subset of the QA pairs. We will also provide a breakdown of question types, distinguishing between direct information extraction and domain-specific reasoning questions, including the respective proportions and illustrative examples. This will allow readers to better assess the benchmark's focus on both capabilities. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with heuristic preprocessing.

full rationale

The paper introduces a new dataset (620 figures, 3100 expert-annotated QA pairs) and a practical preprocessing heuristic (spectral sampling plus interpolation) whose only claim is token reduction while preserving curve features. Ablation studies report empirical performance gains on this same benchmark; these are direct measurements, not predictions derived from fitted parameters or self-referential definitions. No equations, uniqueness theorems, ansatzes, or self-citations are used as load-bearing steps in any derivation chain. The work is therefore self-contained as an empirical contribution and does not reduce any result to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, fitted constants, or new physical entities are introduced; the contribution rests on dataset curation and a practical heuristic for curve representation.

pith-pipeline@v0.9.0 · 5503 in / 1150 out tokens · 43549 ms · 2026-05-07T07:30:16.822864+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 6 canonical work pages · 5 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems (NeurIPS) 35, 23716–23736 (2022)

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems (NeurIPS) 35, 23716–23736 (2022)

  2. [2]

    Anthropic: Claude 4 sonnet.https://www.anthropic.com/claude(2025), ac- cessed: 2026

  3. [3]

    Anthropic: Claude 4.5 sonnet.https://www.anthropic.com/claude(2025), ac- cessed: 2026

  4. [4]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)

  5. [5]

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  6. [6]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  7. [7]

    com/(2025), accessed: 2026

    ByteDance: Doubao seed 1.6 flash multimodal model.https://www.volcengine. com/(2025), accessed: 2026

  8. [8]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3558–3568 (2021)

  9. [9]

    In: Machine Intelli- gence and pattern recognition, vol

    Chazelle, B., Dobkin, D.P.: Optimal convex decompositions. In: Machine Intelli- gence and pattern recognition, vol. 2, pp. 63–133. Elsevier (1985)

  10. [10]

    Accessed 2025- 06-05

    DeepMind: Gemini 2.5 pro (model page).https://deepmind.google/models/ gemini/pro/(2025), official model information / capabilities page. Accessed 2025- 06-05

  11. [11]

    In: International Conference on Learning Representa- tions (ICLR) (2021)

    Dosovitskiy, A., Beyer, L., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representa- tions (ICLR) (2021)

  12. [12]

    Cartographica: the inter- national journal for geographic information and geovisualization10(2), 112–122 (1973)

    Douglas, D.H., Peucker, T.K.: Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: the inter- national journal for geographic information and geovisualization10(2), 112–122 (1973)

  13. [13]

    In: Proceedings of 12th international conference on pattern recognition

    Dubuisson, M.P., Jain, A.K.: A modified hausdorff distance for object matching. In: Proceedings of 12th international conference on pattern recognition. vol. 1, pp. 566–568. IEEE (1994)

  14. [14]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Fan,H.,Su,H.,Guibas,L.:Apointsetgenerationnetworkfor3dobjectreconstruc- tion from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 605–613 (2017)

  15. [15]

    arXiv preprint arXiv:2512.15098 (2025) SpecVQA 17

    Fang, X., Tao, H., Yang, S., Huang, C., Zhong, S., Lu, H., Lyu, H., Li, X., Zhang, L., Ke, G.: Uni-parser technical report. arXiv preprint arXiv:2512.15098 (2025) SpecVQA 17

  16. [16]

    Chemical science 11(18), 4618–4630 (2020)

    Fine, J.A., Rajasekar, A.A., Jethava, K.P., Chopra, G.: Spectral deep learning for prediction and prospective validation of functional groups. Chemical science 11(18), 4618–4630 (2020)

  17. [17]

    Accessed 2025-04-09

    Google Cloud / Vertex AI: Gemini 2.5 flash (product / vertex ai docs).https:// cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash (2025), model card / Vertex AI documentation. Accessed 2025-04-09

  18. [18]

    Google DeepMind: Gemini 3 flash: Fast and efficient multimodal model.https: //ai.google.dev/gemini-api/docs/models/gemini(2025), accessed: 2026

  19. [19]

    dev/gemini-api/docs/models/gemini(2025), accessed: 2026

    GoogleDeepMind:Gemini3pro:Frontiermultimodalmodel.https://ai.google. dev/gemini-api/docs/models/gemini(2025), accessed: 2026

  20. [20]

    In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: Elevating the role of image understanding in visual question answer- ing. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6904–6913 (2017)

  21. [21]

    Analytical Science Advances2(3-4), 128– 141 (2021)

    Houhou, R., Bocklitz, T.: Trends in artificial intelligence, machine learning, and chemometrics applied to chemical data. Analytical Science Advances2(3-4), 128– 141 (2021)

  22. [22]

    IEEE Transactions on pattern analysis and machine intelligence15(9), 850–863 (2002)

    Huttenlocher, D.P., Klanderman, G.A., Rucklidge, W.J.: Comparing images us- ing the hausdorff distance. IEEE Transactions on pattern analysis and machine intelligence15(9), 850–863 (2002)

  23. [23]

    Digital Discovery1(5), 719– 731 (2022)

    Jiang, W., Li, K., Spreadbury, T., Schwenker, E., Cossairt, O., Chan, M.K.: Plot2Spectra: an automatic spectra extraction tool. Digital Discovery1(5), 719– 731 (2022)

  24. [24]

    Naval Research Logistics Quarterly2(1–2), 83–97 (1955)

    Kuhn, H.W.: The hungarian method for the assignment problem. Naval Research Logistics Quarterly2(1–2), 83–97 (1955)

  25. [25]

    Na- ture communications11(1), 1513 (2020)

    Lansford, J.L., Vlachos, D.G.: Infrared spectroscopy data-and physics-driven ma- chine learning for characterizing surface microstructure of complex materials. Na- ture communications11(1), 1513 (2020)

  26. [26]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

  27. [27]

    In: Proceedings of the International Conference on Machine Learning (ICML)

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: Proceedings of the International Conference on Machine Learning (ICML). pp. 19730–19742. PMLR (2023)

  28. [28]

    Advances in Neural Information Processing Systems (NeurIPS)34, 9694–9705 (2021)

    Li,J.,Selvaraju,R.,Gotmare,A.,Joty,S.,Xiong,C.,Hoi,S.C.H.:Alignbeforefuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems (NeurIPS)34, 9694–9705 (2021)

  29. [29]

    Advances in Neural Information Processing Systems (NeurIPS)36, 34892–34916 (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems (NeurIPS)36, 34892–34916 (2023)

  30. [30]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Luo, J., Li, Z., Wang, J., Lin, C.Y.: Chartocr: Data extraction from charts images via a deep hybrid framework. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 1917–1925 (2021)

  31. [31]

    In: Findings of the association for computational linguistics: ACL 2022

    Masry, A., Do, X.L., Tan, J.Q., Joty, S., Hoque, E.: Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In: Findings of the association for computational linguistics: ACL 2022. pp. 2263–2279 (2022)

  32. [32]

    In: Proceedings of the ieee/cvf winter conference on applications of computer vision

    Methani, N., Ganguly, P., Khapra, M.M., Kumar, P.: Plotqa: Reasoning over sci- entific plots. In: Proceedings of the ieee/cvf winter conference on applications of computer vision. pp. 1527–1536 (2020)

  33. [33]

    OpenAI Platform Documentation (Conceptual Citation) (2024) 18 Shen et al

    OpenAI: GPT-4 Mini and Related Variants. OpenAI Platform Documentation (Conceptual Citation) (2024) 18 Shen et al

  34. [34]

    com/index/gpt-4o-system-card/

    OpenAI: Gpt-4o: General-purpose multimodal model (2024),https://openai. com/index/gpt-4o-system-card/

  35. [35]

    Accessed 2025-08-07

    OpenAI: Introducing gpt-5.https://openai.com/index/introducing- gpt- 5/ (2025), official product page / announcement. Accessed 2025-08-07

  36. [36]

    com / index / introducing-o3-and-o4-mini/(2025), official announcement (o3 and o4-mini)

    OpenAI: Introducing openai o3 and o4-mini.https : / / openai . com / index / introducing-o3-and-o4-mini/(2025), official announcement (o3 and o4-mini). Accessed 2025-04-16

  37. [37]

    In: Proceedings of the International Conference on Machine Learning (ICML)

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the International Conference on Machine Learning (ICML). pp. 8748–8763. PmLR (2021)

  38. [38]

    Computer graphics and image processing1(3), 244–256 (1972)

    Ramer, U.: An iterative procedure for the polygonal approximation of plane curves. Computer graphics and image processing1(3), 244–256 (1972)

  39. [39]

    Advances in Neural Information Processing Systems (NeurIPS)37, 18695–18728 (2024)

    Roberts, J., Han, K., Houlsby, N., Albanie, S.: Scifibench: Benchmarking large mul- timodal models for scientific figure interpretation. Advances in Neural Information Processing Systems (NeurIPS)37, 18695–18728 (2024)

  40. [40]

    International journal of computer vision40(2), 99–121 (2000)

    Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. International journal of computer vision40(2), 99–121 (2000)

  41. [41]

    International Journal on Dig- ital Libraries23(3), 289–301 (2022)

    Saikh, T., Ghosal, T., Mittal, A., Ekbal, A., Bhattacharyya, P.: Scienceqa: A novel resource for question answering on scholarly articles. International Journal on Dig- ital Libraries23(3), 289–301 (2022)

  42. [42]

    Analytical chemistry36(8), 1627–1639 (1964)

    Savitzky, A., Golay, M.J.: Smoothing and differentiation of data by simplified least squares procedures. Analytical chemistry36(8), 1627–1639 (1964)

  43. [43]

    Advances in Neural Information Processing Systems (NeurIPS)35, 25278–25294 (2022)

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large- scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems (NeurIPS)35, 25278–25294 (2022)

  44. [44]

    In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2019)

    Tan,H.,Bansal,M.:Lxmert:Learningcross-modalityencoderrepresentationsfrom transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2019)

  45. [45]

    Advances in Neural Information Pro- cessing Systems (NeurIPS)30(2017)

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Pro- cessing Systems (NeurIPS)30(2017)

  46. [46]

    Springer (2009)

    Villani, C.: Optimal Transport: Old and New. Springer (2009)

  47. [47]

    Annual review of statistics and its ap- plication5(2018), 501–532 (2018)

    Wasserman, L.: Topological data analysis. Annual review of statistics and its ap- plication5(2018), 501–532 (2018)

  48. [48]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Wu,Z.,Chen,X.,Pan,Z.,Liu,X.,Liu,W.,Dai,D.,Gao,H.,Ma,Y.,Wu,C.,Wang, B., et al.: Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302 (2024)

  49. [49]

    National Science Review11(12), nwae403 (2024)

    Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review11(12), nwae403 (2024)

  50. [50]

    meaningful,

    Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and vqa. In: Proceedings of the AAAI conference on artificial intelligence. vol. 34, pp. 13041–13049 (2020) Appendix 1 Cases in spectral image understanding by MLLMs In spectral image understanding, MLLMs typically target two complementa...