Towards Characterizing Scientific Image Utility and Upgradability

Chunyi Li; Farong Wen; Guangtao Zhai; Junying Wang; Liang Chen; Qihang Yan; Wenzhe Li; Yijin Guo; Zicheng Zhang

REVIEW 2 major objections 1 minor 1 cited by

Current multimodal systems cannot reliably detect scientific errors in images or generate faithful corrections, revealing a gap between visual perception and scientific validity.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-28 11:08 UTC pith:HN2OWY7W

load-bearing objection The paper sketches a SIU²A framework and benchmark for AI errors in scientific images but the abstract supplies no numbers or validation to support the gap claim. the 2 major comments →

arxiv 2606.03401 v1 pith:HN2OWY7W submitted 2026-06-02 cs.CV

Towards Characterizing Scientific Image Utility and Upgradability

WenZhe Li , Qihang Yan , Liang Chen , Junying Wang , Farong Wen , Yijin Guo , Chunyi Li , Zicheng Zhang

show 1 more author

Guangtao Zhai

This is my paper

classification cs.CV

keywords scientific image evaluationmultimodal modelserror detectionimage corruption taxonomycorrection feasibilityAI-generated contentscientific validitybenchmark dataset

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the SIU²A framework to evaluate scientific images on two axes: utility, which measures the ability to detect errors and judge whether they can be repaired, and upgradability, which measures whether a correction actually restores scientific accuracy without damaging correct parts. It defines four corruption types—Detail Distortion, Incompleteness, False Content, and Entity Confusion—and releases an expert-annotated benchmark that tests both detection and repair. Experiments on this benchmark show that existing multimodal models perform poorly on both error identification and faithful correction. This matters because scientific images function as primary evidence in research, and undetected or poorly fixed errors can propagate false findings. The work therefore supplies a concrete way to quantify how far current AI systems remain from handling scientific visual data in a trustworthy manner.

Core claim

The central claim is that the SIU²A framework, built on a four-category taxonomy of scientific image corruptions and an expert-annotated benchmark, exposes clear limitations in current multimodal systems: they fail to detect scientific inaccuracies and fail to produce corrections that preserve scientific validity, demonstrating a separation between general visual perception and domain-specific scientific usability.

What carries the argument

The SIU²A framework, which splits evaluation into a Utility stage (error detection plus repair-instruction generation) and an Upgradability stage (whether the resulting correction restores validity without altering accurate information), applied to the four corruption categories on the expert-annotated SIU²A-Benchmark.

Load-bearing premise

The four corruption categories form a complete taxonomy of scientific image issues, and expert annotations on the benchmark reliably capture scientific validity.

What would settle it

A multimodal model that scores near ceiling on the SIU²A-Benchmark error-detection and repair tasks yet still produces scientifically invalid corrected images when applied to real research figures would falsify the claim that the benchmark measures scientific usability.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Perceptual quality metrics do not track scientific validity, so new evaluation methods are required.
Multimodal models require domain-specific verification capabilities to handle scientific images.
Faithful correction must preserve accurate information while repairing errors.
The benchmark provides a standardized test for measuring progress on scientific image tasks.
Current systems exhibit a measurable gap between visual perception and scientific usability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be applied to evaluate AI assistance in figure preparation for scientific papers.
Training data that includes explicit scientific validity labels might close the observed gap.
Similar taxonomies and benchmarks could be developed for other scientific modalities such as diagrams or plots.
Automated tools built on this approach might eventually flag questionable figures during peer review.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

The paper sketches a SIU²A framework and benchmark for AI errors in scientific images but the abstract supplies no numbers or validation to support the gap claim.

read the letter

The core point is that this work defines utility (error detection plus correction feasibility) and upgradability (quality of the fix) as two axes for judging scientific images, then builds a benchmark around four corruption types. That framing is new relative to standard perceptual metrics.

The authors correctly note that existing quality scores do not track scientific validity and that general multimodal models lack domain-specific checks. The taxonomy (Detail Distortion, Incompleteness, False Content, Entity Confusion) and the two-stage protocol give a concrete structure that could guide later benchmarks.

The soft spot is the complete absence of results. The abstract states that experiments reveal significant limitations and a fundamental gap, yet it reports no accuracy numbers, no dataset size, no inter-rater agreement on the expert annotations, and no test of whether the four categories are exhaustive or non-overlapping. Without those details the central claim rests on an unshown benchmark.

The stress-test concern holds: if the taxonomy misses common issues such as calibration artifacts or if annotations are noisy, any observed performance gap could be an artifact of the test set rather than a real system limitation. The paper does not address that risk.

This is for computer-vision groups building evaluation tools for scientific content or for publishers tracking image integrity. A reader who wants structured benchmarks in this area could extract useful categories and protocol ideas, but would still need the missing experimental section to judge the claims.

Send it to peer review only if the full manuscript adds quantitative results, taxonomy validation, and annotation reliability metrics; otherwise it stays a proposal.

Referee Report

2 major / 1 minor

Summary. The paper proposes the SIU²A framework to evaluate scientific images along utility (error detection and correction feasibility) and upgradability (correction quality) dimensions. It introduces a four-category taxonomy of corruptions (Detail Distortion, Incompleteness, False Content, Entity Confusion), constructs the SIU²A-Benchmark with expert annotations, and describes a two-stage evaluation protocol. Experiments are claimed to show that current multimodal systems have significant limitations in scientific error assessment and faithful correction, revealing a fundamental gap between visual perception and scientific usability.

Significance. If the taxonomy, benchmark, and experimental results hold after validation, the work could establish a domain-specific evaluation paradigm for AI handling of scientific imagery that goes beyond perceptual metrics, potentially guiding development of more reliable multimodal systems for research communication.

major comments (2)

[Abstract] Abstract: The central claim that 'experiments reveal that current multimodal systems exhibit significant limitations... exposing a fundamental gap' is unsupported because the abstract (and by extension the manuscript) supplies no quantitative results, error metrics, dataset statistics, validation procedures, or inter-rater reliability scores for the expert annotations. This directly undermines the load-bearing experimental evidence for the claimed gap.
[Abstract] Abstract (taxonomy and benchmark construction): The four corruption categories are asserted to be 'fundamental' and used to build SIU²A-Benchmark with expert annotations for error identification and repair, yet no coverage analysis, overlap assessment, or validation that the taxonomy is complete (e.g., versus calibration artifacts or modality-specific noise) is provided. Without this, performance gaps on the benchmark may reflect construction artifacts rather than inherent system limitations.

minor comments (1)

[Abstract] The abstract introduces the acronym SIU²A but does not expand it on first use in a manner consistent with standard academic style.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline revisions to strengthen the presentation of quantitative evidence and taxonomy validation.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'experiments reveal that current multimodal systems exhibit significant limitations... exposing a fundamental gap' is unsupported because the abstract (and by extension the manuscript) supplies no quantitative results, error metrics, dataset statistics, validation procedures, or inter-rater reliability scores for the expert annotations. This directly undermines the load-bearing experimental evidence for the claimed gap.

Authors: We agree the abstract is high-level and omits specific metrics. The full manuscript's Experiments section reports quantitative results (system error rates on detection and correction tasks, SIU²A-Benchmark statistics, and inter-rater reliability scores such as Cohen's kappa for annotations). To address the concern directly, we will revise the abstract to incorporate key quantitative highlights supporting the claimed gap. revision: yes
Referee: [Abstract] Abstract (taxonomy and benchmark construction): The four corruption categories are asserted to be 'fundamental' and used to build SIU²A-Benchmark with expert annotations for error identification and repair, yet no coverage analysis, overlap assessment, or validation that the taxonomy is complete (e.g., versus calibration artifacts or modality-specific noise) is provided. Without this, performance gaps on the benchmark may reflect construction artifacts rather than inherent system limitations.

Authors: The taxonomy was developed via expert consultation on prevalent scientific image issues. We acknowledge the absence of explicit coverage analysis and completeness validation in the current draft. We will add a dedicated subsection describing taxonomy construction, domain coverage, category overlap assessment, and checks against additional corruption types (e.g., calibration artifacts) to confirm the benchmark reflects genuine system limitations rather than construction artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: framework and benchmark proposal is self-contained

full rationale

The paper introduces the SIU²A framework and SIU²A-Benchmark as a new taxonomy-based evaluation approach for scientific images. No equations, fitted parameters, or predictions appear that reduce to inputs by construction. The four corruption categories are presented as a proposed taxonomy rather than derived from prior results or self-citations. Experiments evaluate multimodal systems on the new benchmark without any self-referential fitting or renaming of known results. This matches the default expectation of a non-circular proposal paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The paper introduces a new evaluation framework and associated concepts; the ledger captures the domain assumption motivating the work and the newly postulated framework elements.

axioms (1)

domain assumption Existing evaluation paradigms prove inadequate: perceptual quality metrics poorly correlate with scientific validity, while language models lack domain-specific verification capabilities.
Directly stated in the abstract as the basis for proposing the new framework.

invented entities (3)

SIU²A framework no independent evidence
purpose: Evaluate scientific image utility (error detection and correction feasibility) and upgradability (correction quality)
Newly proposed in the paper.
Four corruption types (Detail Distortion, Incompleteness, False Content, Entity Confusion) no independent evidence
purpose: Categorize scientific image corruptions for systematic evaluation
Introduced as fundamental types in the paper.
SIU²A-Benchmark no independent evidence
purpose: Dataset with expert annotations for testing error identification and repair
Constructed based on the new taxonomy in this work.

pith-pipeline@v0.9.1-grok · 5812 in / 1456 out tokens · 48406 ms · 2026-06-28T11:08:26.927679+00:00 · methodology

0 comments

read the original abstract

Scientific images function as critical evidence in research communication, yet their integrity faces unprecedented threats from AI-generated content that introduces subtle but consequential errors. Existing evaluation paradigms prove inadequate: perceptual quality metrics poorly correlate with scientific validity, while language models lack domain-specific verification capabilities. To address this gap, we propose the \textbf{S}cientific \textbf{I}mage \textbf{U}tility and \textbf{U}pgradability \textbf{A}ssessment (\textbf{SIU$^2$A}) framework, which introduces two complementary dimensions for scientific image evaluation. \textbf{Utility} encompasses \textit{error detection} (identifying scientific inaccuracies) and \textit{correction feasibility} (assessing whether errors can be reliably repaired). \textbf{Upgradability} measures the quality of correction. We categorize scientific image corruption into four fundamental types: Detail Distortion, Incompleteness, False Content, and Entity Confusion. Based on this taxonomy, we construct SIU$^2$A-Benchmark, a dataset with expert annotations for error identification and repair. The framework implements a two-stage evaluation protocol: the \textit{Utility} stage evaluates error detection capability and repair instruction generation, while the \textit{Upgradability} stage assesses whether corrections faithfully restore scientific validity without compromising existing accurate information. Experiments reveal that current multimodal systems exhibit significant limitations in both scientific error assessment and faithful correction, exposing a fundamental gap between visual perception and scientific usability.

Figures

Figures reproduced from arXiv: 2606.03401 by Chunyi Li, Farong Wen, Guangtao Zhai, Junying Wang, Liang Chen, Qihang Yan, Wenzhe Li, Yijin Guo, Zicheng Zhang.

**Figure 2.** Figure 2: The SIU2A framework for scientific image assessment: (a) utility via error detection and correction feasibility, (b) upgradability through a diagnosis-to-correction pipeline, and (c) comparative model performance. 3.1 SIU2A Definition Scientific Images Failure Summary We summarize the common failure modes in scientific images into four structurally distinct categories: (i) Detail Distortion, where low-leve… view at source ↗

**Figure 3.** Figure 3: Pipeline for constructing the SIU2A-Benchmark dataset, including high-quality scientific image filtering, controlled degradation generation, and expert annotation. 3.2 SIU2A-Benchmark Construction. Base Image Collection To support the above formulation, we construct SIU2A-Benchmark, a dataset that jointly evaluates diagnosis, instruction generation, and editing under controlled scientific corruptions as sh… view at source ↗

**Figure 4.** Figure 4: Overview of the SIU2A data structure. Each instance contains a ground-truth image, a corrupted image, error detection and correction feasibility labels, structured error descriptions, correction instructions, and a corresponding scientific QA pair. disentangles functional correctness (task completion) from semantic faithfulness (scientific validity preservation), enabling a comprehensive evaluation of both… view at source ↗

**Figure 5.** Figure 5: Ablation study on the impact of error description quality on correction performance. We [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation study on upgradability: comparing ground-truth versus predicted correction instructions to assess their impact on editing performance. Upgradability Dependence on Correction Instruction Quality for Advanced Model The ablation results in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Our custom annotation interface. The tool presents the scientific figure to the expert [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Utility 1 — Error Detection. The figure displays the input images alongside the groundtruth expert annotations and the outputs from eleven diagnostic VLMs. Each model’s prediction is encoded with a color bar (green for Detect: YES, red for Detect: NO), allowing for immediate assessment of detection accuracy against the known ground truth. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Utility 2 — Correction Feasiblity (page 1 of 2). Results for the first five diagnostic models. The full-width gold EXPERT row reproduces the human GT-Instruction as a reference. A predicted no-error chip replaces a missing instruction when a model declined to flag an error. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Utility 2 — Correction Feasiblity (page 2 of 2). Results for the remaining six diagnostic models, with identical layout and semantics as [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Upgradability — Image Restoration (page 1 of 2). Results for the first five image editing models. The PROMPT band surfaces both prompts side-by-side. The OmniGen-2 Pred-cond cell carries an instruction too long tag, indicating its sensitivity to instruction length. H Additional Ablations This section presents two key ablation studies that dissect the performance of our SIU2A framework. The first study iso… view at source ↗

**Figure 12.** Figure 12: Upgradability — Image Restoration (page 2 of 2). Results for the remaining four image editing models, with identical layout and semantics as [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SciFigAlign: Scoring Scientific Figures by Fine-tuned Alignment of Visuals with Manuscript Evidence
cs.CV 2026-07 conditional novelty 6.0

A fine-tuned CLIP+SciBERT multimodal scorer with manuscript-bound inputs and within-paper ranking reaches MAE 0.35 and 82% pairwise accuracy on held-out conference figures, far above LLM judges.

Reference graph

Works this paper leans on

45 extracted references · 20 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

Introducing claude opus 4.6, Feb

Anthropic. Introducing claude opus 4.6, Feb. 2026. URL https://www.anthropic.com/ news/claude-opus-4-6. Accessed: 2026-04-23

2026
[2]

Bosse, D

S. Bosse, D. Maniry, K.-R. Müller, T. Wiegand, and W. Samek. Deep neural networks for no- reference and full-reference image quality assessment.IEEE Transactions on Image Processing, 27(1):206–219, 2018. doi: 10.1109/TIP.2017.2760518

work page doi:10.1109/tip.2017.2760518 2018
[3]

InstructPix2Pix: Learning to follow image editing instructions.arXiv preprint arXiv:2211.09800, 2022

T. Brooks, A. Holynski, and A. A. Efros. Instructpix2pix: Learning to follow image editing instructions.arXiv preprint arXiv:2211.09800, 2022

work page arXiv 2022
[4]

ByteDance.Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity, Feb. 2026. URL https://lf3-static.bytednsdoc.com/obj/eden-cn/lapzild-tss/ ljhwZthlaukjlkulzlp/seed2/0214/Seed2.0%20Model%20Card.pdf. Official model card. Accessed: 2026-04-23

2026
[5]

H. Cao, Z. Liu, X. Lu, Y . Yao, and Y . Li. Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. InProceedings of the 31st International Conference on Computational Linguistics, pages 354–379, 2025

2025
[6]

Gemini 3 pro image generation model

Google. Gemini 3 pro image generation model. https://aistudio.google.com/models/ gemini-3-pro-image, 2026. Accessed: 2026-04-30

2026
[7]

Gemini 2.5 flash image, 2025

Google DeepMind. Gemini 2.5 flash image, 2025. URL https://ai.google.dev/ gemini-api/docs/models/gemini. Accessed: 2026-04-23

2025
[8]

Gemini 3.1 pro preview, 2026

Google DeepMind. Gemini 3.1 pro preview, 2026. URL https://ai.google.dev/ gemini-api/docs/models/gemini-3.1-pro-preview. Accessed: 2026-04-23

2026
[9]

B. F. Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

2025
[10]

J. Li, D. Zhang, X. Wang, Z. Hao, J. Lei, Q. Tan, C. Zhou, W. Liu, Y . Yang, X. Xiong, et al. Chemvlm: Exploring the power of multimodal large language models in chemistry area. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 415–423, 2025

2025
[11]

W. Li, L. Chen, J. Wang, Y . Guo, Y . Shen, F. Wen, C. Li, Z. Zhang, and G. Zhai. Siqa: Toward reliable scientific image quality assessment.arXiv preprint arXiv:2603.06700, 2026

work page arXiv 2026
[12]

M. Liu, Z. Fan, Z. Wang, L. Gu, Z. Zhu, Y . He, Y . Yang, C. Tian, X. Zhao, N. Liao, et al. Grade: Benchmarking discipline-informed reasoning in image editing.arXiv preprint arXiv:2603.12264, 2026

work page arXiv 2026
[13]

S. Liu, Y . Han, P. Xing, F. Yin, R. Wang, W. Cheng, J. Liao, Y . Wang, H. Fu, C. Han, G. Li, Y . Peng, Q. Sun, J. Wu, Y . Cai, Z. Ge, R. Ming, L. Xia, X. Zeng, Y . Zhu, B. Jiao, X. Zhang, G. Yu, and D. Jiang. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

2022
[15]

T. Lv, Y . Huang, J. Chen, Y . Zhao, Y . Jia, L. Cui, S. Ma, Y . Chang, S. Huang, W. Wang, et al. Kosmos-2.5: A multimodal literate model.arXiv preprint arXiv:2309.11419, 2023

work page Pith review arXiv 2023
[16]

Llama 3.2 model cards and prompt formats

Meta. Llama 3.2 model cards and prompt formats. https://www.llama.com/docs/ model-cards-and-prompt-formats/llama3$_$2/, 2025

2025
[17]

Introducing gpt-image-1.5, Dec

OpenAI. Introducing gpt-image-1.5, Dec. 2025. URL https://openai.com/zh-Hans-CN/ index/new-chatgpt-images-is-here/. Accessed: 2026-04-23

2025
[18]

Gpt-image-2: A multimodal image generation model

OpenAI. Gpt-image-2: A multimodal image generation model. https://openai.com, 2025. Proprietary model, accessed 2026. 10

2025
[19]

Introducing gpt-5.4, 2026

OpenAI. Introducing gpt-5.4, 2026. URL https://openai.com/zh-Hans-CN/index/ introducing-gpt-5-4/. Accessed: 2026-04-23

2026
[20]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026
[21]

Qwen3.6-Plus: Towards real world agents, April 2026

Qwen Team. Qwen3.6-Plus: Towards real world agents, April 2026. URL https://qwen.ai/ blog?id=qwen3.6

2026
[22]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models, 2021

2021
[23]

S. Su, Q. Yan, Y . Zhu, C. Zhang, X. Ge, J. Sun, and Y . Zhang. Blindly assess image quality in the wild guided by a self-adaptive hyper network. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

2020
[24]

H. Tao, C. Huang, N. Wang, H. Lyu, L. Zhang, G. Ke, and X. Fang. Omniscience: A large-scale multi-modal dataset for scientific image understanding.arXiv preprint arXiv:2602.13758, 2026

work page arXiv 2026
[25]

Galactica: A Large Language Model for Science

R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V . Kerkez, and R. Stojnic. Galactica: A large language model for science.arXiv preprint arXiv:2211.09085, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

K. Team, T. Bai, Y . Bai, Y . Bao, S. Cai, Y . Cao, Y . Charles, H. Che, C. Chen, G. Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

V . Team, W. Hong, W. Yu, X. Gu, G. Wang, G. Gan, H. Tang, J. Cheng, J. Qi, J. Ji, L. Pan, S. Duan, W. Wang, Y . Wang, Y . Cheng, Z. He, Z. Su, Z. Yang, Z. Pan, A. Zeng, B. Wang, B. Chen, B. Shi, C. Pang, C. Zhang, D. Yin, F. Yang, G. Chen, J. Xu, J. Zhu, J. Chen, J. Chen, J. Chen, J. Lin, J. Wang, J. Chen, L. Lei, L. Gong, L. Pan, M. Liu, M. Xu, M. Zhang...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

J. Wang, J. Wang, H. Duan, J. Kang, G. Zhai, and X. Min. I2i-bench: A comprehensive benchmark suite for image-to-image editing models.arXiv preprint arXiv:2512.04660, 2025

work page arXiv 2025
[29]

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

X. Wang, Z. Hu, P. Lu, Y . Zhu, J. Zhang, S. Subramaniam, A. R. Loomba, S. Zhang, Y . Sun, and W. Wang. SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models. InProceedings of the Forty-First International Conference on Machine Learning, 2024

2024
[31]

Z. Wang, P. Yin, X. Zhao, C. Tian, Y . Qiao, W. Wang, J. Dai, and G. Luo. Genexam: A multidisciplinary text-to-image exam.arXiv preprint arXiv:2509.14232, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

H. Wei, H. Liu, Z. Wang, Y . Peng, B. Xu, S. Wu, X. Zhang, X. He, Z. Liu, P. Wang, X. Song, Y . Li, Y . Liu, and Y . Zhou. Skywork unipic 3.0: Unified multi-image composition via sequence modeling, 2026. URLhttps://arxiv.org/abs/2601.15664

work page arXiv 2026
[33]

Y . Weng, M. Zhu, Q. Xie, Q. Sun, Z. Lin, S. Liu, and Y . Zhang. Deepscientist: Advancing frontier-pushing scientific findings progressively. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=cZFgsLq8Gs

2026
[34]

C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S. ming Yin, S. Bai, X. Xu, Y . Chen, Y . Chen, Z. Tang, Z. Zhang, Z. Wang, A. Yang, B. Yu, C. Cheng, D. Liu, D. Li, H. Zhang, H. Meng, H. Wei, J. Ni, K. Chen, K. Cao, L. Peng, L. Qu, M. Wu, P. Wang, S. Yu, T. Wen, W. Feng, X. Xu, Y . Wang, Y . Zhang, Y . Zhu, Y . Wu, Y . Cai, and Z. Liu. Qwen-image technical report,
[35]

URLhttps://arxiv.org/abs/2508.02324. 11

work page internal anchor Pith review Pith/arXiv arXiv
[36]

C. Wu, P. Zheng, R. Yan, S. Xiao, X. Luo, Y . Wang, W. Li, X. Jiang, Y . Liu, J. Zhou, Z. Liu, Z. Xia, C. Li, H. Deng, J. Wang, K. Luo, B. Zhang, D. Lian, X. Wang, Z. Wang, T. Huang, and Z. Liu. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

H. Wu, Z. Zhang, W. Zhang, C. Chen, L. Liao, C. Li, Y . Gao, A. Wang, E. Zhang, W. Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Grok-4.20-0309-reasoning

xAI. Grok-4.20-0309-reasoning. https://docs.x.ai/developers/models/grok-4. 20-0309-reasoning, 2026. Accessed: 2026-04-24

2026
[39]

Z. Xi, G. Li, Y . Fan, H. Guo, Y . Liu, X. Fan, J. Liu, J. Ding, W. Zuo, Z. Yin, L. Bai, T. Ji, T. Gui, Q. Zhang, and X. Huang. Bmmr: A large-scale bilingual multimodal multi-discipline reasoning dataset, 2025. URLhttps://arxiv.org/abs/2507.03483

work page arXiv 2025
[40]

Z. Xu, H. Duan, B. Liu, G. Ma, J. Wang, L. Yang, S. Gao, X. Wang, J. Wang, X. Min, et al. Lmm4edit: Benchmarking and evaluating multimodal image editing with lmms. InProceedings of the 33rd ACM International Conference on Multimedia, pages 6908–6917, 2025

2025
[41]

Zhang, H

Z. Zhang, H. Wu, C. Li, Y . Zhou, W. Sun, X. Min, Z. Chen, X. Liu, W. Lin, and G. Zhai. A- bench: Are lmms masters at evaluating ai-generated images?arXiv preprint arXiv:2406.03070, 2024

work page arXiv 2024
[42]

Zhang, T

Z. Zhang, T. Kou, S. Wang, C. Li, W. Sun, W. Wang, X. Li, Z. Wang, X. Cao, X. Min, et al. Q-eval-100k: Evaluating visual quality and alignment level for text-to-vision content. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10621–10631, 2025

2025
[43]

Zhang, J

Z. Zhang, J. Wang, F. Wen, Y . Guo, et al. Large multimodal models evaluation: A survey. SCIENCE CHINA Information Sciences, 68(12):221301–221369, 2025. doi: https://doi.org/10. 1007/s11432-025-4676-4

2025
[44]

found": true,

Z. Zhao, D. Ma, L. Chen, L. Sun, Z. Li, Y . Xia, B. Chen, H. Xu, Z. Zhu, S. Zhu, et al. Chemdfm: a large language foundation model for chemistry.arXiv preprint arXiv:2401.14818, 2024. A Limitations This work introduces a novel evaluation framework (SIU2A), formulates a new task, and constructs a corresponding benchmark dataset. However, it does not propos...

work page arXiv 2024
[45]

Proteasomal degradation

Add an arrow originating from the polyubiquitinated Ino80 (the Ino80 molecule with the chain of 4 Ub moieties attached at the top left of the figure) pointing to a new label that reads 'Proteasomal degradation of ubiquitinated Ino80' to include the missing degradation step. 2. In the replication fork region, add the text label 'H2Aub' adjacent to each Ub ...

[1] [1]

Introducing claude opus 4.6, Feb

Anthropic. Introducing claude opus 4.6, Feb. 2026. URL https://www.anthropic.com/ news/claude-opus-4-6. Accessed: 2026-04-23

2026

[2] [2]

Bosse, D

S. Bosse, D. Maniry, K.-R. Müller, T. Wiegand, and W. Samek. Deep neural networks for no- reference and full-reference image quality assessment.IEEE Transactions on Image Processing, 27(1):206–219, 2018. doi: 10.1109/TIP.2017.2760518

work page doi:10.1109/tip.2017.2760518 2018

[3] [3]

InstructPix2Pix: Learning to follow image editing instructions.arXiv preprint arXiv:2211.09800, 2022

T. Brooks, A. Holynski, and A. A. Efros. Instructpix2pix: Learning to follow image editing instructions.arXiv preprint arXiv:2211.09800, 2022

work page arXiv 2022

[4] [4]

ByteDance.Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity, Feb. 2026. URL https://lf3-static.bytednsdoc.com/obj/eden-cn/lapzild-tss/ ljhwZthlaukjlkulzlp/seed2/0214/Seed2.0%20Model%20Card.pdf. Official model card. Accessed: 2026-04-23

2026

[5] [5]

H. Cao, Z. Liu, X. Lu, Y . Yao, and Y . Li. Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. InProceedings of the 31st International Conference on Computational Linguistics, pages 354–379, 2025

2025

[6] [6]

Gemini 3 pro image generation model

Google. Gemini 3 pro image generation model. https://aistudio.google.com/models/ gemini-3-pro-image, 2026. Accessed: 2026-04-30

2026

[7] [7]

Gemini 2.5 flash image, 2025

Google DeepMind. Gemini 2.5 flash image, 2025. URL https://ai.google.dev/ gemini-api/docs/models/gemini. Accessed: 2026-04-23

2025

[8] [8]

Gemini 3.1 pro preview, 2026

Google DeepMind. Gemini 3.1 pro preview, 2026. URL https://ai.google.dev/ gemini-api/docs/models/gemini-3.1-pro-preview. Accessed: 2026-04-23

2026

[9] [9]

B. F. Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

2025

[10] [10]

J. Li, D. Zhang, X. Wang, Z. Hao, J. Lei, Q. Tan, C. Zhou, W. Liu, Y . Yang, X. Xiong, et al. Chemvlm: Exploring the power of multimodal large language models in chemistry area. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 415–423, 2025

2025

[11] [11]

W. Li, L. Chen, J. Wang, Y . Guo, Y . Shen, F. Wen, C. Li, Z. Zhang, and G. Zhai. Siqa: Toward reliable scientific image quality assessment.arXiv preprint arXiv:2603.06700, 2026

work page arXiv 2026

[12] [12]

M. Liu, Z. Fan, Z. Wang, L. Gu, Z. Zhu, Y . He, Y . Yang, C. Tian, X. Zhao, N. Liao, et al. Grade: Benchmarking discipline-informed reasoning in image editing.arXiv preprint arXiv:2603.12264, 2026

work page arXiv 2026

[13] [13]

S. Liu, Y . Han, P. Xing, F. Yin, R. Wang, W. Cheng, J. Liao, Y . Wang, H. Fu, C. Han, G. Li, Y . Peng, Q. Sun, J. Wu, Y . Cai, Z. Ge, R. Ming, L. Xia, X. Zeng, Y . Zhu, B. Jiao, X. Zhang, G. Yu, and D. Jiang. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

2022

[15] [15]

T. Lv, Y . Huang, J. Chen, Y . Zhao, Y . Jia, L. Cui, S. Ma, Y . Chang, S. Huang, W. Wang, et al. Kosmos-2.5: A multimodal literate model.arXiv preprint arXiv:2309.11419, 2023

work page Pith review arXiv 2023

[16] [16]

Llama 3.2 model cards and prompt formats

Meta. Llama 3.2 model cards and prompt formats. https://www.llama.com/docs/ model-cards-and-prompt-formats/llama3$_$2/, 2025

2025

[17] [17]

Introducing gpt-image-1.5, Dec

OpenAI. Introducing gpt-image-1.5, Dec. 2025. URL https://openai.com/zh-Hans-CN/ index/new-chatgpt-images-is-here/. Accessed: 2026-04-23

2025

[18] [18]

Gpt-image-2: A multimodal image generation model

OpenAI. Gpt-image-2: A multimodal image generation model. https://openai.com, 2025. Proprietary model, accessed 2026. 10

2025

[19] [19]

Introducing gpt-5.4, 2026

OpenAI. Introducing gpt-5.4, 2026. URL https://openai.com/zh-Hans-CN/index/ introducing-gpt-5-4/. Accessed: 2026-04-23

2026

[20] [20]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026

[21] [21]

Qwen3.6-Plus: Towards real world agents, April 2026

Qwen Team. Qwen3.6-Plus: Towards real world agents, April 2026. URL https://qwen.ai/ blog?id=qwen3.6

2026

[22] [22]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models, 2021

2021

[23] [23]

S. Su, Q. Yan, Y . Zhu, C. Zhang, X. Ge, J. Sun, and Y . Zhang. Blindly assess image quality in the wild guided by a self-adaptive hyper network. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

2020

[24] [24]

H. Tao, C. Huang, N. Wang, H. Lyu, L. Zhang, G. Ke, and X. Fang. Omniscience: A large-scale multi-modal dataset for scientific image understanding.arXiv preprint arXiv:2602.13758, 2026

work page arXiv 2026

[25] [25]

Galactica: A Large Language Model for Science

R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V . Kerkez, and R. Stojnic. Galactica: A large language model for science.arXiv preprint arXiv:2211.09085, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

K. Team, T. Bai, Y . Bai, Y . Bao, S. Cai, Y . Cao, Y . Charles, H. Che, C. Chen, G. Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

V . Team, W. Hong, W. Yu, X. Gu, G. Wang, G. Gan, H. Tang, J. Cheng, J. Qi, J. Ji, L. Pan, S. Duan, W. Wang, Y . Wang, Y . Cheng, Z. He, Z. Su, Z. Yang, Z. Pan, A. Zeng, B. Wang, B. Chen, B. Shi, C. Pang, C. Zhang, D. Yin, F. Yang, G. Chen, J. Xu, J. Zhu, J. Chen, J. Chen, J. Chen, J. Lin, J. Wang, J. Chen, L. Lei, L. Gong, L. Pan, M. Liu, M. Xu, M. Zhang...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

J. Wang, J. Wang, H. Duan, J. Kang, G. Zhai, and X. Min. I2i-bench: A comprehensive benchmark suite for image-to-image editing models.arXiv preprint arXiv:2512.04660, 2025

work page arXiv 2025

[29] [29]

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

X. Wang, Z. Hu, P. Lu, Y . Zhu, J. Zhang, S. Subramaniam, A. R. Loomba, S. Zhang, Y . Sun, and W. Wang. SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models. InProceedings of the Forty-First International Conference on Machine Learning, 2024

2024

[31] [31]

Z. Wang, P. Yin, X. Zhao, C. Tian, Y . Qiao, W. Wang, J. Dai, and G. Luo. Genexam: A multidisciplinary text-to-image exam.arXiv preprint arXiv:2509.14232, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

H. Wei, H. Liu, Z. Wang, Y . Peng, B. Xu, S. Wu, X. Zhang, X. He, Z. Liu, P. Wang, X. Song, Y . Li, Y . Liu, and Y . Zhou. Skywork unipic 3.0: Unified multi-image composition via sequence modeling, 2026. URLhttps://arxiv.org/abs/2601.15664

work page arXiv 2026

[33] [33]

Y . Weng, M. Zhu, Q. Xie, Q. Sun, Z. Lin, S. Liu, and Y . Zhang. Deepscientist: Advancing frontier-pushing scientific findings progressively. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=cZFgsLq8Gs

2026

[34] [34]

C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S. ming Yin, S. Bai, X. Xu, Y . Chen, Y . Chen, Z. Tang, Z. Zhang, Z. Wang, A. Yang, B. Yu, C. Cheng, D. Liu, D. Li, H. Zhang, H. Meng, H. Wei, J. Ni, K. Chen, K. Cao, L. Peng, L. Qu, M. Wu, P. Wang, S. Yu, T. Wen, W. Feng, X. Xu, Y . Wang, Y . Zhang, Y . Zhu, Y . Wu, Y . Cai, and Z. Liu. Qwen-image technical report,

[35] [35]

URLhttps://arxiv.org/abs/2508.02324. 11

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

C. Wu, P. Zheng, R. Yan, S. Xiao, X. Luo, Y . Wang, W. Li, X. Jiang, Y . Liu, J. Zhou, Z. Liu, Z. Xia, C. Li, H. Deng, J. Wang, K. Luo, B. Zhang, D. Lian, X. Wang, Z. Wang, T. Huang, and Z. Liu. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

H. Wu, Z. Zhang, W. Zhang, C. Chen, L. Liao, C. Li, Y . Gao, A. Wang, E. Zhang, W. Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Grok-4.20-0309-reasoning

xAI. Grok-4.20-0309-reasoning. https://docs.x.ai/developers/models/grok-4. 20-0309-reasoning, 2026. Accessed: 2026-04-24

2026

[39] [39]

Z. Xi, G. Li, Y . Fan, H. Guo, Y . Liu, X. Fan, J. Liu, J. Ding, W. Zuo, Z. Yin, L. Bai, T. Ji, T. Gui, Q. Zhang, and X. Huang. Bmmr: A large-scale bilingual multimodal multi-discipline reasoning dataset, 2025. URLhttps://arxiv.org/abs/2507.03483

work page arXiv 2025

[40] [40]

Z. Xu, H. Duan, B. Liu, G. Ma, J. Wang, L. Yang, S. Gao, X. Wang, J. Wang, X. Min, et al. Lmm4edit: Benchmarking and evaluating multimodal image editing with lmms. InProceedings of the 33rd ACM International Conference on Multimedia, pages 6908–6917, 2025

2025

[41] [41]

Zhang, H

Z. Zhang, H. Wu, C. Li, Y . Zhou, W. Sun, X. Min, Z. Chen, X. Liu, W. Lin, and G. Zhai. A- bench: Are lmms masters at evaluating ai-generated images?arXiv preprint arXiv:2406.03070, 2024

work page arXiv 2024

[42] [42]

Zhang, T

Z. Zhang, T. Kou, S. Wang, C. Li, W. Sun, W. Wang, X. Li, Z. Wang, X. Cao, X. Min, et al. Q-eval-100k: Evaluating visual quality and alignment level for text-to-vision content. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10621–10631, 2025

2025

[43] [43]

Zhang, J

Z. Zhang, J. Wang, F. Wen, Y . Guo, et al. Large multimodal models evaluation: A survey. SCIENCE CHINA Information Sciences, 68(12):221301–221369, 2025. doi: https://doi.org/10. 1007/s11432-025-4676-4

2025

[44] [44]

found": true,

Z. Zhao, D. Ma, L. Chen, L. Sun, Z. Li, Y . Xia, B. Chen, H. Xu, Z. Zhu, S. Zhu, et al. Chemdfm: a large language foundation model for chemistry.arXiv preprint arXiv:2401.14818, 2024. A Limitations This work introduces a novel evaluation framework (SIU2A), formulates a new task, and constructs a corresponding benchmark dataset. However, it does not propos...

work page arXiv 2024

[45] [45]

Proteasomal degradation

Add an arrow originating from the polyubiquitinated Ino80 (the Ino80 molecule with the chain of 4 Ub moieties attached at the top left of the figure) pointing to a new label that reads 'Proteasomal degradation of ubiquitinated Ino80' to include the missing degradation step. 2. In the replication fork region, add the text label 'H2Aub' adjacent to each Ub ...