S1-Omni-Image: A Unified Model for Scientific Image Understanding, Generation, and Editing

Nan Xu; QingLi Wang; Qingxiao Li; Zikai Wang

arxiv: 2606.24441 · v1 · pith:T4YHJKJLnew · submitted 2026-06-23 · 💻 cs.CV

S1-Omni-Image: A Unified Model for Scientific Image Understanding, Generation, and Editing

Qingxiao Li , Zikai Wang , Qingli Wang , Nan Xu This is my paper

Pith reviewed 2026-06-26 00:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords scientific image generationimage editingmultimodal reasoningunified modelscientific visualizationthink-before-generatedomain-specific image tasks

0 comments

The pith

A single model handles scientific image understanding, generation, and editing by first producing a reasoning trace whose hidden states then condition the output images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a unified model on top of an existing scientific multimodal reasoning backbone. The model first generates a task-oriented reasoning trace and special token, then routes their internal states into a generation module to produce or edit images. This setup targets scientific domains where outputs must respect semantics, structural relations, and domain knowledge rather than just visual appearance. If the conditioning works, the same weights can support understanding, creation of diagrams and visualizations, and domain-specific editing tasks without separate fine-tuning for each. Readers would care because scientific image work often requires both deep interpretation and precise synthesis in one workflow.

Core claim

S1-Omni-Image couples the understanding capability of S1-VL-32B with an image generation module under a unified think-before-generate paradigm. Given a user instruction, the model first produces a task-oriented reasoning trace, a textual answer, and a task special token; their hidden states are then injected into the generation module to condition image generation or editing.

What carries the argument

The think-before-generate paradigm, in which hidden states from the reasoning trace and task special token are injected into the generation module to condition outputs on scientific semantics and domain knowledge.

If this is right

The model outperforms open-source models on GenExam and TechImage-Bench for scientific image generation.
It achieves state-of-the-art results on four editing benchmarks: MSD, cigRockSEM, SynthRAD2025, and IXI.
It preserves stable performance on scientific image understanding evaluations inherited from the base model.
Scientific tasks such as segmentation, medical image translation, and super-resolution can be cast as native image editing problems within the same framework.
The approach supports multi-turn illustration editing and text rendering in logical diagrams and data charts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same injection mechanism might allow the model to accept external knowledge sources as additional conditioning tokens without retraining the generation weights.
Extending the paradigm to video or 3D scientific data would test whether the reasoning-to-generation bridge generalizes beyond static images.
If the hidden-state conditioning proves robust, separate task-specific models for understanding versus synthesis could become unnecessary in scientific pipelines.
The released SciGenEdit dataset could serve as a starting point for testing whether other reasoning backbones produce usable conditioning states for image tasks.

Load-bearing premise

Hidden states from the reasoning trace can be injected into the generation module to reliably condition image outputs on scientific semantics, structural relations, and domain knowledge without further task-specific adaptation or loss of fidelity.

What would settle it

An ablation that removes the hidden-state injection from the generation module and measures the resulting drop on the MSD, cigRockSEM, SynthRAD2025, and IXI editing benchmarks would show whether the conditioning step is required for the reported gains.

Figures

Figures reproduced from arXiv: 2606.24441 by Nan Xu, QingLi Wang, Qingxiao Li, Zikai Wang.

**Figure 1.** Figure 1: Overall performance of S1-Omni-Image on scientific image generation (a), scientific image editing (b), and scientific image understanding (c). For visualization, the MSD and cigRockSEM scores in the image editing panel are multiplied by 100; bar heights should not be interpreted as directly comparable across different benchmarks. 1 arXiv:2606.24441v1 [cs.CV] 23 Jun 2026 [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 2.** Figure 2: Capability overview of S1-Omni-Image on scientific image generation. The examples demonstrate the model’s ability to generate scientific illustrations from textual instructions, including structured framework diagrams, mechanism illustrations, relational comparisons, data charts, and realistic scientific visualizations. 2 [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Capability overview of S1-Omni-Image on scientific image editing. The examples cover scientific image segmentation, scientific illustration editing, scientific image super-resolution, and scientific image translation such as CBCT-to-CT conversion. These tasks can be handled as instruction-conditioned image editing. 3 [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Architecture of S1-Omni-Image. The model uses S1-VL-32B as the scientific multimodal reasoning backbone and injects reasoning representations into an MMDiT image generation and editing module through a reasoning-to-diffusion alignment layer. Understanding outputs are generated by the VLM text branch, while the image decoder is used only for generation and editing. • <image_edit>: route to the image editing… view at source ↗

**Figure 5.** Figure 5: Overview of the SciGenEdit dataset. The figure shows task distribution, language distribution, turn distribution for the full 314K training set, and image-type and discipline distributions for the 25K scientific illustration subset. editing, and image understanding. Image generation accounts for 32.80%, image editing for 60.83%, and image understanding for 6.37% of the dataset. The image understanding dat… view at source ↗

**Figure 6.** Figure 6: Three-stage training strategy of S1-Omni-Image. Stage I performs scientific reasoning tuning, Stage II trains the reasoning-to-diffusion alignment layer, and Stage III jointly optimizes the alignment layer and DiT on SciGenEdit. ples, for joint training. The backbone participates in online forward computation but remains frozen, while the alignment layer and MMDiT are trained to enable scientific reasoning… view at source ↗

**Figure 7.** Figure 7: Qualitative comparison with representative models on scientific illustration generation benchmarks. The examples cover GenExam, TechImage-Bench, and CVTG-2K, evaluating scientific illustration generation and complex text rendering. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison with representative models on scientific image editing benchmarks. The examples cover medical image segmentation, rock microstructure segmentation, medical image translation, pathology image translation, and medical super-resolution. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison on image understanding tasks. The example is from HRBench-4K in the Thinking-with-Images benchmark, comparing our model with the multimodal foundation model S1-VL-32B on high-resolution visual reasoning and understanding. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Think-before-generate case in scientific illustration generation. The model first produces task-relevant reasoning and layout planning, and then generates a structured scientific illustration conditioned on the reasoning representation. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Medical segmentation, image translation, and MRI super-resolution as image editing tasks. It uses reasoning to guide region localization, modality transformation, and detail restoration within a unified framework. 8 Case Study 8.1 Scientific Image Generation [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative comparison with representative image generation models. Under the same prompts, S1- Omni-Image more consistently preserves scientific structures, arrow relations, label semantics, and global logical coherence. Compared with other open-source models, it shows clear advantages in scientific visual style and approaches closed-source model behavior in several structure-heavy scientific cases. 24 … view at source ↗

**Figure 13.** Figure 13: Multi-turn scientific illustration editing case. The model first generates an initial scientific illustration from the user prompt, and then progressively edits module colors and details, adds modules and text, and preserves cross-turn structure and layout consistency according to consecutive editing instructions. For space, reasoning traces are omitted and only the input prompt, editing instructions, and… view at source ↗

**Figure 14.** Figure 14: Failure cases in text rendering. The examples show that S1-Omni-Image may produce unreadable or incorrect text in scientific images, including glyph-level errors, blurred characters, missing or extra strokes, and semantic substitutions. These failures are especially challenging for dense annotations and Chinese text rendering. 9 Limitations Although S1-Omni-Image achieves strong performance on scientific … view at source ↗

**Figure 15.** Figure 15: Failure case in instruction following. Given a complex editing instruction, S1-Omni-Image fails to apply the requested visual modification and produces an output that remains nearly unchanged from the input image. This suggests that complex local editing and long-horizon instruction execution remain challenging. Instruction following in complex editing. The model can perform instruction-based scientific i… view at source ↗

read the original abstract

We present S1-Omni-Image, an open-weight unified multimodal model for scientific image understanding, generation, and editing. Unlike general-purpose image generation models, scientific image tasks require not only high-fidelity synthesis, but also robust understanding of scientific semantics, structural relations, domain knowledge, and task intent. To this end, S1-Omni-Image builds on the scientific multimodal reasoning backbone S1-VL-32B and couples its understanding capability with an image generation module under a unified think-before-generate paradigm. Given a user instruction, the model first produces a task-oriented reasoning trace, a textual answer, and a task special token; their hidden states are then injected into the generation module to condition image generation or editing. S1-Omni-Image supports scientific image understanding, generation, and editing in a unified framework. For generation, it focuses on scientific illustrations and text rendering, including logical diagrams, relational comparisons, data charts, and realistic scientific visualizations. For editing, it casts segmentation and other domain-specific vision tasks as native image editing problems, enabling multi-turn illustration editing, medical and geographic image segmentation, medical image translation, and scientific image super-resolution. We construct SciGenEdit, a 314K-sample training dataset, and release the model weights, inference code, and SciGenEdit-10K. Experiments show that S1-Omni-Image substantially improves scientific image generation and editing while preserving the scientific image understanding capability inherited from S1-VL-32B. It outperforms open-source models on GenExam and TechImage-Bench, achieves state-of-the-art results on four editing benchmarks including MSD, cigRockSEM, SynthRAD2025, and IXI, and maintains stable performance on scientific image understanding evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

S1-Omni-Image layers a think-before-generate step on S1-VL-32B and ships a 314K dataset plus open weights, but the SOTA claims sit on top of missing ablations and methods details.

read the letter

The paper adds a unified model called S1-Omni-Image that takes the S1-VL-32B backbone and couples it to an image generation module via hidden states from a reasoning trace and task token. The new pieces are the SciGenEdit 314K training set, the explicit think-before-generate flow, and the decision to treat editing tasks such as medical segmentation and super-resolution as native image edits.

Releasing weights, inference code, and a 10K subset is the clearest positive. The domain focus on scientific diagrams, charts, and medical images is also practical; many labs need tools that respect structural relations and domain conventions rather than generic aesthetics.

The soft spots are straightforward. The abstract states substantial gains and SOTA numbers on GenExam, TechImage-Bench, MSD, cigRockSEM, SynthRAD2025, and IXI, yet supplies no baselines, splits, error bars, or ablation results. The central mechanism—injecting reasoning hidden states to condition generation—receives no controlled test showing it outperforms direct text conditioning or random vectors. Without those checks it is impossible to tell whether the reported improvements come from the new architecture, the added data, or the base model. The soundness score in the Pith Report looks accurate on the evidence available.

This work is for groups already using or extending large multimodal models on scientific data. Readers who want an open starting point for domain-specific generation and editing can extract value from the artifacts even if the evaluation section needs expansion.

It deserves a serious referee because the open release lets others run the necessary ablations. I would send it for review with a clear request for the missing experimental controls on the conditioning step.

Referee Report

2 major / 1 minor

Summary. The manuscript presents S1-Omni-Image, an open-weight unified multimodal model extending the S1-VL-32B scientific reasoning backbone for image understanding, generation, and editing. It employs a think-before-generate paradigm in which a task-oriented reasoning trace, textual answer, and task special token are first produced; their hidden states are then injected into a generation module to condition outputs on scientific semantics, structural relations, and domain knowledge. The model is trained on the newly introduced SciGenEdit dataset (314K samples) and evaluated on generation benchmarks (GenExam, TechImage-Bench) and editing benchmarks (MSD, cigRockSEM, SynthRAD2025, IXI), with claims of substantial improvements over open-source models, SOTA editing performance, and preserved understanding capability.

Significance. If the results hold, the work provides a concrete step toward unified scientific multimodal models that integrate reasoning with controllable generation and editing, which could benefit domains requiring high-fidelity, semantically accurate imagery. The explicit release of model weights, inference code, and the SciGenEdit-10K subset is a clear strength for reproducibility and downstream use.

major comments (2)

[Experiments] Experiments section: The reported gains on GenExam, TechImage-Bench, and the four editing benchmarks are end-to-end results only; no ablations are described that test the contribution of the injected hidden states from the reasoning trace and task special token (e.g., zeroing the states, replacing with random vectors, or comparing against direct text conditioning). This mechanism is load-bearing for the central claim that the think-before-generate paradigm transfers S1-VL-32B understanding without further adaptation or fidelity loss.
[Method] Method section (architecture description): The precise mechanism for injecting the hidden states into the generation module, including any additional parameters, alignment losses, or training stages required for the coupling, is not detailed enough to assess whether the conditioning reliably preserves scientific semantics or introduces artifacts.

minor comments (1)

[Abstract] Abstract: The SOTA and 'substantially improves' claims would be strengthened by explicit mention of the primary competing methods and quantitative margins even at a high level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additions.

read point-by-point responses

Referee: [Experiments] Experiments section: The reported gains on GenExam, TechImage-Bench, and the four editing benchmarks are end-to-end results only; no ablations are described that test the contribution of the injected hidden states from the reasoning trace and task special token (e.g., zeroing the states, replacing with random vectors, or comparing against direct text conditioning). This mechanism is load-bearing for the central claim that the think-before-generate paradigm transfers S1-VL-32B understanding without further adaptation or fidelity loss.

Authors: We agree that the manuscript would be strengthened by explicit ablations isolating the role of the injected hidden states. The current end-to-end results and comparisons to open-source baselines demonstrate overall gains, but do not directly quantify the contribution of the reasoning trace and task token states. In the revised manuscript we will add ablation studies in the Experiments section, including variants that zero the states, replace them with random vectors, and compare against direct text conditioning, to directly test the load-bearing claim. revision: yes
Referee: [Method] Method section (architecture description): The precise mechanism for injecting the hidden states into the generation module, including any additional parameters, alignment losses, or training stages required for the coupling, is not detailed enough to assess whether the conditioning reliably preserves scientific semantics or introduces artifacts.

Authors: We acknowledge that the current method description is insufficiently detailed on the injection process. In the revised manuscript we will expand the architecture subsection to specify the exact injection mechanism (e.g., cross-attention layers or concatenation), any newly introduced parameters, the alignment losses employed, and the staged training procedure used to couple the reasoning backbone with the generation module. This will enable readers to evaluate semantic preservation and potential artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in model construction or empirical claims

full rationale

The paper describes an architectural extension of the S1-VL-32B backbone by injecting hidden states from a reasoning trace and task token into a new generation module under a think-before-generate paradigm, then reports end-to-end benchmark results on GenExam, TechImage-Bench, MSD, cigRockSEM, SynthRAD2025, and IXI plus stable understanding performance. No equations, fitted parameters, or first-principles derivations are presented that reduce by construction to the inputs; the central claims rest on the new SciGenEdit dataset and external benchmark comparisons rather than self-referential definitions or load-bearing self-citations. The inheritance of understanding capability from the cited base model is a standard transfer step and does not create a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes standard large-model training dynamics and transfer from the backbone model.

pith-pipeline@v0.9.1-grok · 5865 in / 1401 out tokens · 33872 ms · 2026-06-26T00:18:46.245923+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 16 linked inside Pith

[1]

Qwen-vl: A frontier large vision-language model with versatile abilities

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966,

Pith/arXiv arXiv
[2]

Intern-s1: A scientiﬁc multimodal foundation model

Lei Bai, Zhongrui Cai, Yuhang Cao, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, et al. Intern-s1: A scientiﬁc multimodal foundation model. arXiv preprint arXiv:2508.15763, 2025a. Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Q...

arXiv
[3]

28 Valentin Boussot, Cédric Hémon, Jean-Claude Nunes, and Jean-Louis Dillenseger

Accessed 2026-06-05. 28 Valentin Boussot, Cédric Hémon, Jean-Claude Nunes, and Jean-Louis Dillenseger. Why registra- tion quality matters: Enhancing sct synthesis with impact-based registration. arXiv preprint arXiv:2510.21358,

arXiv 2026
[4]

Gmai-mmbench: A comprehensive multimodal evaluation bench- mark towards general medical ai

Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, et al. Gmai-mmbench: A comprehensive multimodal evaluation bench- mark towards general medical ai. Advances in Neural Information Processing Systems , 37:94327–94427, 2024a. Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, ...

Pith/arXiv arXiv
[5]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pp. 24185–24198, 2024b. Gheorghe Comanici, Eric B...

Pith/arXiv arXiv
[6]

Diffedit: Diffusion-based semantic image editing with mask guidance

Guillaume Couairon, Jakob V erbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427,

arXiv
[7]

Emerging properties in uniﬁed multimodal pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in uniﬁed multimodal pretraining. arXiv preprint arXiv:2505.14683,

Pith/arXiv arXiv
[8]

Textcrafter: Accurately rendering multiple texts in complex visual scenes

Nikai Du, Zhennan Chen, Zhizhou Chen, Shan Gao, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes. arXiv preprint arXiv:2503.23461,

arXiv
[9]

Seed-x: Multimodal models with uniﬁed multi-granularity comprehension and generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Ding Xiaohan, and Ying Shan. Seed-x: Multimodal models with uniﬁed multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396,

Pith/arXiv arXiv
[10]

Dai, Anja Hauth, Katie Millican, et al

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, et al. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

Pith/arXiv arXiv
[11]

Luyi Han, Tao Tan, Yunzhi Huang, Haoran Dou, Tianyu Zhang, Yuan Gao, Xin Wang, Chunyao Lu, Xinglong Liang, Yue Sun, et al

arXiv:2509.14232. Luyi Han, Tao Tan, Yunzhi Huang, Haoran Dou, Tianyu Zhang, Yuan Gao, Xin Wang, Chunyao Lu, Xinglong Liang, Yue Sun, et al. All-in-one medical image-to-image translation. Cell Reports Methods, 5 (8):101138,

Pith/arXiv arXiv
[12]

Prompt-to- prompt image editing with cross attention control

Amir Hertz, Ron Mokady , Jay Tenenbaum, Kﬁr Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to- prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626,

Pith/arXiv arXiv
[13]

Lance: Uniﬁed multimodal modeling by multi-task synergy

Lance Team. Lance: Uniﬁed multimodal modeling by multi-task synergy . arXiv preprint arXiv:2605.18678,

Pith/arXiv arXiv
[14]

S1-vl: Scientiﬁc multimodal reasoning model with thinking-with-images

Qingxiao Li, Lifeng Xu, Qingli Wang, Yudong Bai, Mingwei Ou, Shu Hu, and Nan Xu. S1-vl: Scientiﬁc multimodal reasoning model with thinking-with-images. arXiv preprint arXiv:2604.21409,

Pith/arXiv arXiv
[15]

Step1x-edit: A practical framework for general image editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025b. Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhao- chong An, Fanny Yang, Aditya Pate...

Pith/arXiv arXiv
[16]

Techimage-bench: Rubric- based evaluation for professional image generation

Minheng Ni, Zhengyuan Yang, Yaowen Zhang, Linjie Li, Chung-Ching Lin, Kevin Lin, Zhendong Wang, Xiaofei Wang, Shujie Liu, Lei Zhang, Wangmeng Zuo, and Lijuan Wang. Techimage-bench: Rubric- based evaluation for professional image generation. arXiv preprint arXiv:2512.12220,

arXiv
[17]

Rodríguez, David Vázquez, Issam H

Juan A. Rodríguez, David Vázquez, Issam H. Laradji, Marco Pedersoli, and Pau Rodríguez. Figgen: Text to scientiﬁc ﬁgure generation. arXiv preprint arXiv:2306.00800,

arXiv
[18]

Openai gpt-5 system card

Aaditya Singh, Adam Fry , Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky , Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267,

Pith/arXiv arXiv
[19]

Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers

30 Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers. arXiv preprint arXiv:2506.23918,

Pith/arXiv arXiv
[20]

Generative multimodal models are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pp. 14398–14409, 2024a. Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, ...

2024
[21]

Qwen-image technical report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, et al. Qwen-image technical report. arXiv preprint arXiv:2508.02324, 2025a. Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for uniﬁed multimodal under- standing and generation. In Proceedings ...

Pith/arXiv arXiv
[22]

Show-o: One single transformer to unify multimodal understanding and generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. In International Conference on Learning Representations , vol- ume 2025, pp. 28240–28264,

2025
[23]

Imgedit: A uniﬁed image editing dataset and benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A uniﬁed image editing dataset and benchmark. arXiv preprint arXiv:2505.20275,

Pith/arXiv arXiv
[24]

A benchmark dataset and baseline methods for rock mi- crostructure interpretation in sem images

Yao Zhang, Xinming Wu, and Jiachun You. A benchmark dataset and baseline methods for rock mi- crostructure interpretation in sem images. Scientiﬁc Data, 12(1):1671, 2025a. 31 Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images. arXiv preprint arXiv:2508.116...

Pith/arXiv arXiv
[25]

Autoﬁgure: Generating and reﬁning publication-ready scientiﬁc illustrations

Minjun Zhu, Zhen Lin, Yixuan Weng, Panzhong Lu, Qiujie Xie, Yifan Wei, Sifan Liu, Qiyao Sun, and Yue Zhang. Autoﬁgure: Generating and reﬁning publication-ready scientiﬁc illustrations. arXiv preprint arXiv:2602.03828,

arXiv

[1] [1]

Qwen-vl: A frontier large vision-language model with versatile abilities

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966,

Pith/arXiv arXiv

[2] [2]

Intern-s1: A scientiﬁc multimodal foundation model

Lei Bai, Zhongrui Cai, Yuhang Cao, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, et al. Intern-s1: A scientiﬁc multimodal foundation model. arXiv preprint arXiv:2508.15763, 2025a. Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Q...

arXiv

[3] [3]

28 Valentin Boussot, Cédric Hémon, Jean-Claude Nunes, and Jean-Louis Dillenseger

Accessed 2026-06-05. 28 Valentin Boussot, Cédric Hémon, Jean-Claude Nunes, and Jean-Louis Dillenseger. Why registra- tion quality matters: Enhancing sct synthesis with impact-based registration. arXiv preprint arXiv:2510.21358,

arXiv 2026

[4] [4]

Gmai-mmbench: A comprehensive multimodal evaluation bench- mark towards general medical ai

Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, et al. Gmai-mmbench: A comprehensive multimodal evaluation bench- mark towards general medical ai. Advances in Neural Information Processing Systems , 37:94327–94427, 2024a. Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, ...

Pith/arXiv arXiv

[5] [5]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pp. 24185–24198, 2024b. Gheorghe Comanici, Eric B...

Pith/arXiv arXiv

[6] [6]

Diffedit: Diffusion-based semantic image editing with mask guidance

Guillaume Couairon, Jakob V erbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427,

arXiv

[7] [7]

Emerging properties in uniﬁed multimodal pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in uniﬁed multimodal pretraining. arXiv preprint arXiv:2505.14683,

Pith/arXiv arXiv

[8] [8]

Textcrafter: Accurately rendering multiple texts in complex visual scenes

Nikai Du, Zhennan Chen, Zhizhou Chen, Shan Gao, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes. arXiv preprint arXiv:2503.23461,

arXiv

[9] [9]

Seed-x: Multimodal models with uniﬁed multi-granularity comprehension and generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Ding Xiaohan, and Ying Shan. Seed-x: Multimodal models with uniﬁed multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396,

Pith/arXiv arXiv

[10] [10]

Dai, Anja Hauth, Katie Millican, et al

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, et al. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

Pith/arXiv arXiv

[11] [11]

Luyi Han, Tao Tan, Yunzhi Huang, Haoran Dou, Tianyu Zhang, Yuan Gao, Xin Wang, Chunyao Lu, Xinglong Liang, Yue Sun, et al

arXiv:2509.14232. Luyi Han, Tao Tan, Yunzhi Huang, Haoran Dou, Tianyu Zhang, Yuan Gao, Xin Wang, Chunyao Lu, Xinglong Liang, Yue Sun, et al. All-in-one medical image-to-image translation. Cell Reports Methods, 5 (8):101138,

Pith/arXiv arXiv

[12] [12]

Prompt-to- prompt image editing with cross attention control

Amir Hertz, Ron Mokady , Jay Tenenbaum, Kﬁr Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to- prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626,

Pith/arXiv arXiv

[13] [13]

Lance: Uniﬁed multimodal modeling by multi-task synergy

Lance Team. Lance: Uniﬁed multimodal modeling by multi-task synergy . arXiv preprint arXiv:2605.18678,

Pith/arXiv arXiv

[14] [14]

S1-vl: Scientiﬁc multimodal reasoning model with thinking-with-images

Qingxiao Li, Lifeng Xu, Qingli Wang, Yudong Bai, Mingwei Ou, Shu Hu, and Nan Xu. S1-vl: Scientiﬁc multimodal reasoning model with thinking-with-images. arXiv preprint arXiv:2604.21409,

Pith/arXiv arXiv

[15] [15]

Step1x-edit: A practical framework for general image editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025b. Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhao- chong An, Fanny Yang, Aditya Pate...

Pith/arXiv arXiv

[16] [16]

Techimage-bench: Rubric- based evaluation for professional image generation

Minheng Ni, Zhengyuan Yang, Yaowen Zhang, Linjie Li, Chung-Ching Lin, Kevin Lin, Zhendong Wang, Xiaofei Wang, Shujie Liu, Lei Zhang, Wangmeng Zuo, and Lijuan Wang. Techimage-bench: Rubric- based evaluation for professional image generation. arXiv preprint arXiv:2512.12220,

arXiv

[17] [17]

Rodríguez, David Vázquez, Issam H

Juan A. Rodríguez, David Vázquez, Issam H. Laradji, Marco Pedersoli, and Pau Rodríguez. Figgen: Text to scientiﬁc ﬁgure generation. arXiv preprint arXiv:2306.00800,

arXiv

[18] [18]

Openai gpt-5 system card

Aaditya Singh, Adam Fry , Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky , Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267,

Pith/arXiv arXiv

[19] [19]

Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers

30 Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers. arXiv preprint arXiv:2506.23918,

Pith/arXiv arXiv

[20] [20]

Generative multimodal models are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pp. 14398–14409, 2024a. Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, ...

2024

[21] [21]

Qwen-image technical report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, et al. Qwen-image technical report. arXiv preprint arXiv:2508.02324, 2025a. Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for uniﬁed multimodal under- standing and generation. In Proceedings ...

Pith/arXiv arXiv

[22] [22]

Show-o: One single transformer to unify multimodal understanding and generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. In International Conference on Learning Representations , vol- ume 2025, pp. 28240–28264,

2025

[23] [23]

Imgedit: A uniﬁed image editing dataset and benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A uniﬁed image editing dataset and benchmark. arXiv preprint arXiv:2505.20275,

Pith/arXiv arXiv

[24] [24]

A benchmark dataset and baseline methods for rock mi- crostructure interpretation in sem images

Yao Zhang, Xinming Wu, and Jiachun You. A benchmark dataset and baseline methods for rock mi- crostructure interpretation in sem images. Scientiﬁc Data, 12(1):1671, 2025a. 31 Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images. arXiv preprint arXiv:2508.116...

Pith/arXiv arXiv

[25] [25]

Autoﬁgure: Generating and reﬁning publication-ready scientiﬁc illustrations

Minjun Zhu, Zhen Lin, Yixuan Weng, Panzhong Lu, Qiujie Xie, Yifan Wei, Sifan Liu, Qiyao Sun, and Yue Zhang. Autoﬁgure: Generating and reﬁning publication-ready scientiﬁc illustrations. arXiv preprint arXiv:2602.03828,

arXiv