DisciplineGen-1M: A Large-Scale Dataset for Multidisciplinary Visual Generation and Editing

Junchi Yan; Leyao Gu; Mingxin Liu; Mohan Zhang; Ning Liao; Shaofeng Zhang; Xiangyu Zhao; Xuanhe Zhou; Xue Yang; Yiguo He

arxiv: 2607.02290 · v1 · pith:ZNQ5SZBEnew · submitted 2026-07-02 · 💻 cs.CV

DisciplineGen-1M: A Large-Scale Dataset for Multidisciplinary Visual Generation and Editing

Zhaokai Wang , Mingxin Liu , Zirun Zhu , Ziqian Fan , Yiguo He , Mohan Zhang , Leyao Gu , Xiangyu Zhao

show 6 more authors

Ning Liao Shaofeng Zhang Xuanhe Zhou Zhihang Zhong Junchi Yan Xue Yang

This is my paper

Pith reviewed 2026-07-03 15:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords DisciplineGen-1Mmultidisciplinary datasetvisual generationimage editingknowledge-grounded generationtext-to-imageacademic diagramsstructured annotations

0 comments

The pith

DisciplineGen-1M supplies 1.2 million multidisciplinary samples to train image models on accurate diagrams and edits instead of plausible visuals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a million-scale dataset drawn from ten academic fields to address the failure of current image generators on knowledge-intensive content. It claims that structured visual data tied to disciplinary concepts, symbolic structures, and precise spatial relations can shift generation toward verifiable correctness. The authors construct the samples through combined vector-graphics rendering, OCR editing, programmatic synthesis, and filtering pipelines that also supply captions, instructions, and controllable paired images. They train a reasoning-generation model on the data and report gains on discipline-specific benchmarks plus transfer to broader reasoning tests. The work positions large structured academic visuals as essential for moving beyond aesthetic outputs.

Core claim

DisciplineGen-1M contains 1.2M samples across mathematics, physics, chemistry, biology, geography, computer science, economics, history, music, and sports. Its construction framework of vector-graphics rendering, OCR-based editing, curated programmatic synthesis, and large-scale text-to-image filtering yields captions, editing instructions, structured annotations, and image pairs with controllable semantic differences. A discipline-informed model trained on the dataset produces substantial gains on GenExam and GRADE while transferring to WISE and RISE, supporting the claim that such data moves generation toward knowledge-grounded creation.

What carries the argument

The scalable construction framework that combines vector-graphics rendering, OCR-based editing, curated programmatic synthesis, and large-scale text-to-image filtering to generate paired images, captions, editing instructions, and structured annotations with controllable semantic differences.

If this is right

Models trained on the dataset improve performance on discipline-related benchmarks GenExam and GRADE.
The training approach transfers to gains on general reasoning-informed benchmarks WISE and RISE.
The dataset directly supports both text-to-image generation and image editing with knowledge-grounded outputs.
Public release of the dataset, model, and curation pipeline enables reproducibility and extension by others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The controllable differences in paired samples could let researchers isolate which spatial or symbolic features most affect model accuracy.
Similar curation methods might scale to additional domains such as engineering diagrams or medical illustrations if the same pipeline components are adapted.
Widespread use of such datasets could reduce reliance on post-hoc correction in educational or scientific visualization tools.
The emphasis on verifiable correctness over visual appeal alone may encourage new evaluation metrics focused on conceptual fidelity.

Load-bearing premise

The construction pipelines produce samples whose correctness matches disciplinary concepts and whose semantic differences are controllable enough to drive measurable model gains.

What would settle it

Training a model on DisciplineGen-1M and finding no improvement or a decline on the GenExam and GRADE benchmarks relative to open-source baselines trained without the data would falsify the claim of measurable gains from this structured academic visual data.

read the original abstract

Recent image generation and editing models can produce visually appealing natural images, yet they remain unreliable when the target image is a knowledge-intensive diagram whose correctness depends on disciplinary concepts, symbolic structure, and precise spatial relations. We introduce DisciplineGen-1M, a million-scale multidisciplinary dataset that supports text-to-image generation and image editing. It contains 1.2M samples spanning mathematics, physics, chemistry, biology, geography, computer science, economics, history, music, and sports. To construct the dataset, we design a scalable framework that combines vector-graphics rendering, OCR-based editing, curated programmatic synthesis, and large-scale text-to-image filtering. These pipelines produce captions, editing instructions, structured annotations, and paired images with controllable semantic differences. Building on DisciplineGen-1M, we further introduce a discipline-informed reasoning-generation model for both text-to-image generation and image editing. Experiments on discipline-related benchmarks, GenExam and GRADE, show substantial improvements over open-source baselines, while evaluations on general reasoning-informed benchmarks, WISE and RISE, further indicate broader transfer. The results suggest that large-scale structured academic visual data is a key ingredient for moving image generation from aesthetic plausibility toward verifiable knowledge-grounded visual creation. We will publicly release our dataset, model, and source code of the data curation pipeline to ensure reproducibility and benefit future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New 1.2M multidisciplinary diagram dataset with mixed construction pipelines, but T2I filtering step lacks described verification for factual correctness.

read the letter

The paper's main contribution is DisciplineGen-1M, a 1.2M-sample dataset covering ten disciplines with paired images, captions, and editing instructions. The construction mixes vector-graphics rendering, OCR editing, programmatic synthesis, and T2I filtering, then trains a discipline-informed model that reports gains on GenExam and GRADE plus some transfer to WISE and RISE.

They do the scaling and breadth well, and committing to release the data, model, and curation code is the right move for a dataset paper. That alone gives the work a concrete use case for anyone trying to move image generation toward technical accuracy.

The soft spot is the T2I filtering pipeline. The abstract itself states that current T2I models are unreliable for knowledge-intensive diagrams, yet the text does not describe post-filter checks, expert review, or automated verification that the retained samples have correct symbolic structure and spatial relations. If a non-trivial fraction of the data comes from this route and carries errors, the benchmark improvements cannot be cleanly attributed to better training data. The abstract also gives no numbers or error analysis, so the size of the gains is still opaque.

This is for researchers in multimodal generation who need structured academic visuals. It is worth sending to peer review because the artifact is new and the release plan lets the community test the quality claims directly. Expect referees to press on the verification gap.

Referee Report

2 major / 2 minor

Summary. The paper introduces DisciplineGen-1M, a 1.2M-sample multidisciplinary dataset spanning 10 fields for text-to-image generation and editing. It is built via vector-graphics rendering, OCR-based editing, curated programmatic synthesis, and large-scale T2I filtering, and is paired with a discipline-informed reasoning-generation model. Experiments claim substantial gains on GenExam and GRADE plus transfer to WISE and RISE, supporting the thesis that large-scale structured academic visual data enables verifiable knowledge-grounded image creation.

Significance. If the dataset samples are verifiably correct and the benchmark gains are reproducible, the work would supply a concrete resource and modeling approach for shifting image generation toward disciplinary fidelity rather than mere visual plausibility.

major comments (2)

[Abstract] Abstract: the assertion of 'substantial improvements' on GenExam and GRADE is unsupported by any quantitative numbers, error bars, or factual-correctness verification in the provided text; without these the central claim that the dataset drives 'verifiable knowledge-grounded' gains cannot be evaluated.
[Construction framework] Construction framework (T2I filtering pipeline): the manuscript acknowledges that existing T2I models 'remain unreliable when the target image is a knowledge-intensive diagram whose correctness depends on disciplinary concepts,' yet describes no post-filter verification, expert review, or automated symbolic/spatial checks; if a non-negligible fraction of the 1.2M samples originates from this step, measured benchmark gains cannot be attributed to the claimed correctness property.

minor comments (2)

Clarify the exact fraction of samples produced by each pipeline (vector-graphics, OCR, programmatic, T2I) so readers can assess the weight of the unverified filtering component.
Add explicit citations and brief descriptions for GenExam, GRADE, WISE, and RISE in the abstract and introduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and data construction pipeline. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion of 'substantial improvements' on GenExam and GRADE is unsupported by any quantitative numbers, error bars, or factual-correctness verification in the provided text; without these the central claim that the dataset drives 'verifiable knowledge-grounded' gains cannot be evaluated.

Authors: We agree that the abstract should include quantitative support. The submitted abstract summarized the experimental outcomes without specific metrics. In the revised manuscript we will expand the abstract to report the concrete improvements (e.g., accuracy deltas and standard deviations) on GenExam and GRADE relative to the open-source baselines, thereby grounding the claim of substantial gains. revision: yes
Referee: [Construction framework] Construction framework (T2I filtering pipeline): the manuscript acknowledges that existing T2I models 'remain unreliable when the target image is a knowledge-intensive diagram whose correctness depends on disciplinary concepts,' yet describes no post-filter verification, expert review, or automated symbolic/spatial checks; if a non-negligible fraction of the 1.2M samples originates from this step, measured benchmark gains cannot be attributed to the claimed correctness property.

Authors: The concern is valid. The T2I filtering step is one of four construction pipelines and the manuscript does not detail post-filter verification procedures. We will revise the construction section to (i) report the approximate fraction of samples produced by each pipeline, (ii) describe any automated spatial or symbolic consistency checks that were applied during filtering, and (iii) add an explicit limitations paragraph discussing the residual risk that some T2I-generated samples may contain factual inaccuracies. The benchmark gains are measured on held-out discipline-specific tests; we will clarify that these results provide empirical support but do not constitute exhaustive verification of every sample. revision: partial

Circularity Check

0 steps flagged

No circularity: dataset construction and benchmark results are independent of self-referential definitions or fits

full rationale

The paper describes a data curation pipeline (vector-graphics rendering, OCR editing, programmatic synthesis, T2I filtering) and reports gains on GenExam/GRADE and WISE/RISE benchmarks. No equations, fitted parameters, predictions, or uniqueness theorems appear. The central claim that structured academic data enables knowledge-grounded generation rests on external benchmark measurements rather than any reduction to the paper's own inputs by construction. No self-citations are invoked as load-bearing premises. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all construction details are described at the level of high-level pipelines without numerical fitting or new postulated objects.

pith-pipeline@v0.9.1-grok · 5820 in / 1050 out tokens · 19096 ms · 2026-07-03T15:37:00.357195+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

117 extracted references · 20 canonical work pages · 13 internal anchors

[1]

Biochemistry Free For All

Kevin Ahern, Indira Rajagopal, and Taralyn Tan. Biochemistry Free For All. Oregon State University, Corvallis, OR, 2018. Open textbook licensed under a Creative Commons Attribution-NonCommercial license

2018
[2]

historical-basemaps, 2026

aourednik. historical-basemaps, 2026. Online; accessed 14-May-2026

2026
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, 11 Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixu...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Automatikz: Text-guided synthesis of scientific vector graphics with tikz

Jonas Belouadi, Anne Lauscher, and Steffen Eger. Automatikz: Text-guided synthesis of scientific vector graphics with tikz. In International Conference on Learning Representations, volume 2024, pages 55917–55943, 2024

2024
[5]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions, 2022

2022
[6]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021

2021
[7]

Scaleedit-12m: Scaling open-source image editing data generation via multi-agent framework, 2026

Guanzhou Chen, Erfei Cui, Changyao Tian, Danni Yang, Ganlin Yang, Yu Qiao, Hongsheng Li, Gen Luo, and Hongjie Zhang. Scaleedit-12m: Scaling open-source image editing data generation via multi-agent framework, 2026

2026
[8]

Janus-pro: Unified multimodal understanding and generation with data and model scaling, 2025

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling, 2025

2025
[9]

Wikimedia commons, 2026

Wikimedia Commons. Wikimedia commons, 2026. Online; accessed 14-May-2026

2026
[10]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021

2021
[13]

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, et al. Sensenova-u1: Unifying multimodal understanding and generation with neo-unify architecture. arXiv preprint arXiv:2605.12500, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Gemini 3 flash: frontier intelligence built for speed

Google. Gemini 3 flash: frontier intelligence built for speed. https://blog.google/products-and-platforms/products/ gemini/gemini-3-flash/, 2025

2025
[15]

Introducing nano banana pro.https://blog.google/innovation-and-ai/products/nano-banana-pro/, 2025

Google. Introducing nano banana pro.https://blog.google/innovation-and-ai/products/nano-banana-pro/, 2025

2025
[16]

Nano banana 2: Combining pro capabilities with lightning-fast speed

Google. Nano banana 2: Combining pro capabilities with lightning-fast speed. https://blog.google/innovation-and-ai/ technology/ai/nano-banana-2/, 2025

2025
[17]

Greenlaw and David Shapiro

Steven A. Greenlaw and David Shapiro. Principles of Economics 2e. OpenStax, Houston, TX, 2017. Open textbook, licensed under CC BY 4.0

2017
[18]

Greenlaw and David Shapiro

Steven A. Greenlaw and David Shapiro. Principles of Macroeconomics 2e. OpenStax, Houston, TX, 2017. Open textbook, licensed under CC BY 4.0

2017
[19]

Greenlaw and David Shapiro

Steven A. Greenlaw and David Shapiro. Principles of Microeconomics 2e. OpenStax, Houston, TX, 2017. Open textbook, licensed under CC BY 4.0

2017
[20]

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the maestro dataset. arXiv preprint arXiv:1810.12247, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Prompt-to-prompt image editing with cross attention control, 2022

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control, 2022

2022
[22]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

2020
[23]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. 12

2022
[24]

Chemeval: A multi-level and fine-grained chemical capability evaluation for large language models, 2026

Yuqing Huang, Rongyang Zhang, Xuesong He, Xuyang Zhi, Hao Wang, Nuo Chen, Liuzongbo, Xin Li, Feiyang Xu, Deguang Liu, Huadong Liang, YiLi, Jian Cui, Yin Xu, Shijin Wang, Guiquan Liu, Qi Liu, Defu Lian, and Enhong Chen. Chemeval: A multi-level and fine-grained chemical capability evaluation for large language models, 2026

2026
[25]

Smartedit: Exploring complex instruction-based image editing with multimodal large language models, 2023

Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, and Ying Shan. Smartedit: Exploring complex instruction-based image editing with multimodal large language models, 2023

2023
[26]

Microbiology: Canadian Edition

Wendy Keenleyside. Microbiology: Canadian Edition. eCampusOntario, Toronto, ON, 2019. First edition; derived from OpenStax Microbiology; licensed under CC BY 4.0, except where otherwise noted

2019
[27]

John W. Kimball. Kimball’s biology pages.https://www.biology-pages.info/, 2025. Online biology textbook/reference distributed under CC BY 3.0

2025
[28]

Vectoredits: A dataset and benchmark for instruction-based editing of vector graphics, 2025

Josef Kuchar, Marek Kadlcik, Michal Spiegel, and Michal stefanik. Vectoredits: A dataset and benchmark for instruction-based editing of vector graphics, 2025

2025
[29]

FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

2025
[30]

Rdkit: Open-source cheminformatics, 2025

Greg Landrum. Rdkit: Open-source cheminformatics, 2025

2025
[31]

S1-omni-image: A unified model for scientific image understanding, generation, and editing, 2026

Qingxiao Li, Zikai Wang, Qingli Wang, and Nan Xu. S1-omni-image: A unified model for scientific image understanding, generation, and editing, 2026

2026
[32]

Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

Juekai Lin, Yun Zhu, Honglin Lin, Sijing Li, Tianwei Lin, Zheng Liu, Xiaoyang Wang, Wenqiao Zhang, and Lijun Wu. Scientific graphics program synthesis via dual self-consistency reinforcement learning. arXiv preprint arXiv:2604.06079, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Grade: Benchmarking discipline-informed reasoning in image editing

Mingxin Liu, Ziqian Fan, Zhaokai Wang, Leyao Gu, Zirun Zhu, Yiguo He, Yuchen Yang, Changyao Tian, Xiangyu Zhao, Ning Liao, et al. Grade: Benchmarking discipline-informed reasoning in image editing. arXiv preprint arXiv:2603.12264, 2026

work page arXiv 2026
[34]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2021

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2021

2021
[36]

Wise: A world knowledge-informed semantic evaluation for text-to-image generation, 2025

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, and Li Yuan. Wise: A world knowledge-informed semantic evaluation for text-to-image generation, 2025

2025
[37]

Gpt-image-1

OpenAI. Gpt-image-1. https://openai.com/index/image-generation-api/, 2025

2025
[38]

Gpt-image-1.5

OpenAI. Gpt-image-1.5. https://openai.com/zh-Hans-CN/index/new-chatgpt-images-is-here/, 2025

2025
[39]

Gpt-image-2.https://openai.com/, 2026

OpenAI. Gpt-image-2.https://openai.com/, 2026. AI image generation model

2026
[40]

Pavlov, M

D. Pavlov, M. Rybalkin, B. Karulin, M. Kozhevnikov, A. Savelyev, and A. Churinov. Indigo: universal cheminformatics api. Journal of Cheminformatics, 3(1):P4, 2011

2011
[41]

Bizgen: Advancing article-level visual text rendering for infographics generation, 2025

Yuyang Peng, Shishi Xiao, Keming Wu, Qisheng Liao, Bohan Chen, Kevin Lin, Danqing Huang, Ji Li, and Yuhui Yuan. Bizgen: Advancing article-level visual text rendering for infographics generation, 2025

2025
[42]

Pico-banana-400k: A large-scale dataset for text-guided image editing, 2025

Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-banana-400k: A large-scale dataset for text-guided image editing, 2025

2025
[43]

Xiao, Katherine M

Zeju Qiu, Weiyang Liu, Haiwen Feng, Zhen Liu, Tim Z. Xiao, Katherine M. Collins, Joshua B. Tenenbaum, Adrian Weller, Michael J. Black, and Bernhard Schölkopf. Can large language models understand symbolic graphics programs? arXiv preprint arXiv:2408.08313, 2024

work page arXiv 2024
[44]

Learning-based methods for comparing sequences, with applications to audio-to-midi alignment and matching

Colin Raffel. Learning-based methods for comparing sequences, with applications to audio-to-midi alignment and matching. Columbia University, 2016

2016
[45]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

2022
[46]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. 13

2022
[47]

Nadine Schneider, Nikolaus Stiefl, and Gregory A. Landrum. What’s what: The (nearly) definitive guide to reaction role assignment. Journal of Chemical Information and Modeling, 56(12):2336–2346, 2016

2016
[48]

Laion-5b: An open large-scale dataset for training next generation image-text models, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022

2022
[49]

Seedream 4.5.https://seed.bytedance.com/en/seedream4_5/, 2025

Seedream. Seedream 4.5.https://seed.bytedance.com/en/seedream4_5/, 2025

2025
[50]

Seedream 5.0.https://seed.bytedance.com/en/seedream5_0_lite/, 2026

Seedream. Seedream 5.0.https://seed.bytedance.com/en/seedream5_0_lite/, 2026

2026
[51]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Mathcanvas: Intrinsic visual chain-of-thought for multimodal mathematical reasoning, 2025

Weikang Shi, Aldrich Yu, Rongyao Fang, Houxing Ren, Ke Wang, Aojun Zhou, Changyao Tian, Xinyu Fu, Yuxuan Hu, Zimu Lu, Linjiang Huang, Si Liu, Rui Liu, and Hongsheng Li. Mathcanvas: Intrinsic visual chain-of-thought for multimodal mathematical reasoning, 2025

2025
[53]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

From pixels to prose: A large dataset of dense image captions, 2024

Vasu Singla, Kaiyu Yue, Sukriti Paul, Reza Shirkavand, Mayuka Jayawardhana, Alireza Ganjdanesh, Heng Huang, Abhinav Bhatele, Gowthami Somepalli, and Tom Goldstein. From pixels to prose: A large dataset of dense image captions, 2024

2024
[55]

Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning, 2021

Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning, 2021

2021
[56]

Generative multimodal models are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14398–14409, 2024

2024
[57]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models, 2024. URL https://arxiv. org/abs/2405.09818, 9(8), 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

GeoGebra, 2026

GeoGebra Team. GeoGebra, 2026. Online; accessed 14-May-2026

2026
[59]

Qwen3-max: Just scale it, September 2025

Qwen Team. Qwen3-max: Just scale it, September 2025

2025
[60]

Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing

Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, et al. Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing. arXiv preprint arXiv:2603.09877, 2026

work page arXiv 2026
[61]

K12 dataset.https://huggingface.co/datasets/WaltonFuture/K12, 2026

WaltonFuture. K12 dataset.https://huggingface.co/datasets/WaltonFuture/K12, 2026. Accessed: 2026-05-27

2026
[62]

Textatlas5m: A large-scale dataset for dense text image generation, 2025

Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, Lijuan Wang, and Min Li. Textatlas5m: A large-scale dataset for dense text image generation, 2025

2025
[63]

Deepgen 1.0: A lightweight unified multimodal model for advancing image generation and editing

Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, et al. Deepgen 1.0: A lightweight unified multimodal model for advancing image generation and editing. arXiv preprint arXiv:2602.12205, 2026

work page arXiv 2026
[64]

Internsvg: Towards unified svg tasks with multimodal large language models

Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, Yanwen Guo, Wenhai Wang, Kai Chen, Yu Qiao, and Hongjie Zhang. Internsvg: Towards unified svg tasks with multimodal large language models. arXiv preprint arXiv:2510.11341, 2025

work page arXiv 2025
[65]

Mv-math: Evaluating multimodal math reasoning in multi-visual contexts

Peijie Wang, Zhong-Zhi Li, Fei Yin, Dekang Ran, and Cheng-Lin Liu. Mv-math: Evaluating multimodal math reasoning in multi-visual contexts. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19541–19551, 2025

2025
[66]

GenExam: A Multidisciplinary Text-to-Image Exam

Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, and Gen Luo. Genexam: A multidisciplinary text-to-image exam. arXiv preprint arXiv:2509.14232, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau

Zijie J. Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models, 2022

2022
[68]

Omniedit: Building image editing generalist models through specialist supervision, 2024

Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, and Wenhu Chen. Omniedit: Building image editing generalist models through specialist supervision, 2024. 14

2024
[69]

Smiles, a chemical language and information system

David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences, 28(1):31–36, 1988

1988
[70]

Qwen-image technical report, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

2025
[71]

Tunesformer: Forming irish tunes with control codes by bar patching

Shangda Wu, Xiaobing Li, Feng Yu, and Maosong Sun. Tunesformer: Forming irish tunes with control codes by bar patching. arXiv preprint arXiv:2301.02884, 2023

work page arXiv 2023
[72]

Dreamomni: Unified image generation and editing, 2025

Bin Xia, Yuechen Zhang, Jingyao Li, Chengyao Wang, Yitong Wang, Xinglong Wu, Bei Yu, and Jiaya Jia. Dreamomni: Unified image generation and editing, 2025

2025
[73]

Omnigen: Unified image generation

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025

2025
[74]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. arXiv preprint arXiv:2506.15564, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

Scaling autoregressive models for content-rich text-to-image generation, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022

2022
[76]

Anyedit: Mastering unified high-quality image editing for any idea, 2024

Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea, 2024

2024
[77]

Implementation and benchmarking of perceptual image hash functions

Christoph Zauner. Implementation and benchmarking of perceptual image hash functions. Master’s thesis, University of Applied Sciences Hagenberg, 2010

2010
[78]

The chembl database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods

Barbara Zdrazil, Eloy Felix, Fiona Hunter, Emma J Manners, James Blackshaw, Sybilla Corbett, Marleen de Veij, Harris Ioannidis, David Mendez Lopez, Juan F Mosquera, Maria Paula Magarinos, Nicolas Bosc, Ricardo Arcila, Tevfik Kiziloren, Anna Gaulton, A Patricia Bento, Melissa F Adasme, Peter Monecke, Gregory A Landrum, and Andrew R Leach. The chembl databa...

2023
[79]

Magicbrush: A manually annotated dataset for instruction-guided image editing, 2023

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing, 2023

2023
[80]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

2023

Showing first 80 references.

[1] [1]

Biochemistry Free For All

Kevin Ahern, Indira Rajagopal, and Taralyn Tan. Biochemistry Free For All. Oregon State University, Corvallis, OR, 2018. Open textbook licensed under a Creative Commons Attribution-NonCommercial license

2018

[2] [2]

historical-basemaps, 2026

aourednik. historical-basemaps, 2026. Online; accessed 14-May-2026

2026

[3] [3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, 11 Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixu...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Automatikz: Text-guided synthesis of scientific vector graphics with tikz

Jonas Belouadi, Anne Lauscher, and Steffen Eger. Automatikz: Text-guided synthesis of scientific vector graphics with tikz. In International Conference on Learning Representations, volume 2024, pages 55917–55943, 2024

2024

[5] [5]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions, 2022

2022

[6] [6]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021

2021

[7] [7]

Scaleedit-12m: Scaling open-source image editing data generation via multi-agent framework, 2026

Guanzhou Chen, Erfei Cui, Changyao Tian, Danni Yang, Ganlin Yang, Yu Qiao, Hongsheng Li, Gen Luo, and Hongjie Zhang. Scaleedit-12m: Scaling open-source image editing data generation via multi-agent framework, 2026

2026

[8] [8]

Janus-pro: Unified multimodal understanding and generation with data and model scaling, 2025

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling, 2025

2025

[9] [9]

Wikimedia commons, 2026

Wikimedia Commons. Wikimedia commons, 2026. Online; accessed 14-May-2026

2026

[10] [10]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021

2021

[13] [13]

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, et al. Sensenova-u1: Unifying multimodal understanding and generation with neo-unify architecture. arXiv preprint arXiv:2605.12500, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

Gemini 3 flash: frontier intelligence built for speed

Google. Gemini 3 flash: frontier intelligence built for speed. https://blog.google/products-and-platforms/products/ gemini/gemini-3-flash/, 2025

2025

[15] [15]

Introducing nano banana pro.https://blog.google/innovation-and-ai/products/nano-banana-pro/, 2025

Google. Introducing nano banana pro.https://blog.google/innovation-and-ai/products/nano-banana-pro/, 2025

2025

[16] [16]

Nano banana 2: Combining pro capabilities with lightning-fast speed

Google. Nano banana 2: Combining pro capabilities with lightning-fast speed. https://blog.google/innovation-and-ai/ technology/ai/nano-banana-2/, 2025

2025

[17] [17]

Greenlaw and David Shapiro

Steven A. Greenlaw and David Shapiro. Principles of Economics 2e. OpenStax, Houston, TX, 2017. Open textbook, licensed under CC BY 4.0

2017

[18] [18]

Greenlaw and David Shapiro

Steven A. Greenlaw and David Shapiro. Principles of Macroeconomics 2e. OpenStax, Houston, TX, 2017. Open textbook, licensed under CC BY 4.0

2017

[19] [19]

Greenlaw and David Shapiro

Steven A. Greenlaw and David Shapiro. Principles of Microeconomics 2e. OpenStax, Houston, TX, 2017. Open textbook, licensed under CC BY 4.0

2017

[20] [20]

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the maestro dataset. arXiv preprint arXiv:1810.12247, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [21]

Prompt-to-prompt image editing with cross attention control, 2022

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control, 2022

2022

[22] [22]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

2020

[23] [23]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. 12

2022

[24] [24]

Chemeval: A multi-level and fine-grained chemical capability evaluation for large language models, 2026

Yuqing Huang, Rongyang Zhang, Xuesong He, Xuyang Zhi, Hao Wang, Nuo Chen, Liuzongbo, Xin Li, Feiyang Xu, Deguang Liu, Huadong Liang, YiLi, Jian Cui, Yin Xu, Shijin Wang, Guiquan Liu, Qi Liu, Defu Lian, and Enhong Chen. Chemeval: A multi-level and fine-grained chemical capability evaluation for large language models, 2026

2026

[25] [25]

Smartedit: Exploring complex instruction-based image editing with multimodal large language models, 2023

Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, and Ying Shan. Smartedit: Exploring complex instruction-based image editing with multimodal large language models, 2023

2023

[26] [26]

Microbiology: Canadian Edition

Wendy Keenleyside. Microbiology: Canadian Edition. eCampusOntario, Toronto, ON, 2019. First edition; derived from OpenStax Microbiology; licensed under CC BY 4.0, except where otherwise noted

2019

[27] [27]

John W. Kimball. Kimball’s biology pages.https://www.biology-pages.info/, 2025. Online biology textbook/reference distributed under CC BY 3.0

2025

[28] [28]

Vectoredits: A dataset and benchmark for instruction-based editing of vector graphics, 2025

Josef Kuchar, Marek Kadlcik, Michal Spiegel, and Michal stefanik. Vectoredits: A dataset and benchmark for instruction-based editing of vector graphics, 2025

2025

[29] [29]

FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

2025

[30] [30]

Rdkit: Open-source cheminformatics, 2025

Greg Landrum. Rdkit: Open-source cheminformatics, 2025

2025

[31] [31]

S1-omni-image: A unified model for scientific image understanding, generation, and editing, 2026

Qingxiao Li, Zikai Wang, Qingli Wang, and Nan Xu. S1-omni-image: A unified model for scientific image understanding, generation, and editing, 2026

2026

[32] [32]

Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

Juekai Lin, Yun Zhu, Honglin Lin, Sijing Li, Tianwei Lin, Zheng Liu, Xiaoyang Wang, Wenqiao Zhang, and Lijun Wu. Scientific graphics program synthesis via dual self-consistency reinforcement learning. arXiv preprint arXiv:2604.06079, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [33]

Grade: Benchmarking discipline-informed reasoning in image editing

Mingxin Liu, Ziqian Fan, Zhaokai Wang, Leyao Gu, Zirun Zhu, Yiguo He, Yuchen Yang, Changyao Tian, Xiangyu Zhao, Ning Liao, et al. Grade: Benchmarking discipline-informed reasoning in image editing. arXiv preprint arXiv:2603.12264, 2026

work page arXiv 2026

[34] [34]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2021

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2021

2021

[36] [36]

Wise: A world knowledge-informed semantic evaluation for text-to-image generation, 2025

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, and Li Yuan. Wise: A world knowledge-informed semantic evaluation for text-to-image generation, 2025

2025

[37] [37]

Gpt-image-1

OpenAI. Gpt-image-1. https://openai.com/index/image-generation-api/, 2025

2025

[38] [38]

Gpt-image-1.5

OpenAI. Gpt-image-1.5. https://openai.com/zh-Hans-CN/index/new-chatgpt-images-is-here/, 2025

2025

[39] [39]

Gpt-image-2.https://openai.com/, 2026

OpenAI. Gpt-image-2.https://openai.com/, 2026. AI image generation model

2026

[40] [40]

Pavlov, M

D. Pavlov, M. Rybalkin, B. Karulin, M. Kozhevnikov, A. Savelyev, and A. Churinov. Indigo: universal cheminformatics api. Journal of Cheminformatics, 3(1):P4, 2011

2011

[41] [41]

Bizgen: Advancing article-level visual text rendering for infographics generation, 2025

Yuyang Peng, Shishi Xiao, Keming Wu, Qisheng Liao, Bohan Chen, Kevin Lin, Danqing Huang, Ji Li, and Yuhui Yuan. Bizgen: Advancing article-level visual text rendering for infographics generation, 2025

2025

[42] [42]

Pico-banana-400k: A large-scale dataset for text-guided image editing, 2025

Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-banana-400k: A large-scale dataset for text-guided image editing, 2025

2025

[43] [43]

Xiao, Katherine M

Zeju Qiu, Weiyang Liu, Haiwen Feng, Zhen Liu, Tim Z. Xiao, Katherine M. Collins, Joshua B. Tenenbaum, Adrian Weller, Michael J. Black, and Bernhard Schölkopf. Can large language models understand symbolic graphics programs? arXiv preprint arXiv:2408.08313, 2024

work page arXiv 2024

[44] [44]

Learning-based methods for comparing sequences, with applications to audio-to-midi alignment and matching

Colin Raffel. Learning-based methods for comparing sequences, with applications to audio-to-midi alignment and matching. Columbia University, 2016

2016

[45] [45]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

2022

[46] [46]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. 13

2022

[47] [47]

Nadine Schneider, Nikolaus Stiefl, and Gregory A. Landrum. What’s what: The (nearly) definitive guide to reaction role assignment. Journal of Chemical Information and Modeling, 56(12):2336–2346, 2016

2016

[48] [48]

Laion-5b: An open large-scale dataset for training next generation image-text models, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022

2022

[49] [49]

Seedream 4.5.https://seed.bytedance.com/en/seedream4_5/, 2025

Seedream. Seedream 4.5.https://seed.bytedance.com/en/seedream4_5/, 2025

2025

[50] [50]

Seedream 5.0.https://seed.bytedance.com/en/seedream5_0_lite/, 2026

Seedream. Seedream 5.0.https://seed.bytedance.com/en/seedream5_0_lite/, 2026

2026

[51] [51]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Mathcanvas: Intrinsic visual chain-of-thought for multimodal mathematical reasoning, 2025

Weikang Shi, Aldrich Yu, Rongyao Fang, Houxing Ren, Ke Wang, Aojun Zhou, Changyao Tian, Xinyu Fu, Yuxuan Hu, Zimu Lu, Linjiang Huang, Si Liu, Rui Liu, and Hongsheng Li. Mathcanvas: Intrinsic visual chain-of-thought for multimodal mathematical reasoning, 2025

2025

[53] [53]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

From pixels to prose: A large dataset of dense image captions, 2024

Vasu Singla, Kaiyu Yue, Sukriti Paul, Reza Shirkavand, Mayuka Jayawardhana, Alireza Ganjdanesh, Heng Huang, Abhinav Bhatele, Gowthami Somepalli, and Tom Goldstein. From pixels to prose: A large dataset of dense image captions, 2024

2024

[55] [55]

Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning, 2021

Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning, 2021

2021

[56] [56]

Generative multimodal models are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14398–14409, 2024

2024

[57] [57]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models, 2024. URL https://arxiv. org/abs/2405.09818, 9(8), 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [58]

GeoGebra, 2026

GeoGebra Team. GeoGebra, 2026. Online; accessed 14-May-2026

2026

[59] [59]

Qwen3-max: Just scale it, September 2025

Qwen Team. Qwen3-max: Just scale it, September 2025

2025

[60] [60]

Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing

Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, et al. Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing. arXiv preprint arXiv:2603.09877, 2026

work page arXiv 2026

[61] [61]

K12 dataset.https://huggingface.co/datasets/WaltonFuture/K12, 2026

WaltonFuture. K12 dataset.https://huggingface.co/datasets/WaltonFuture/K12, 2026. Accessed: 2026-05-27

2026

[62] [62]

Textatlas5m: A large-scale dataset for dense text image generation, 2025

Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, Lijuan Wang, and Min Li. Textatlas5m: A large-scale dataset for dense text image generation, 2025

2025

[63] [63]

Deepgen 1.0: A lightweight unified multimodal model for advancing image generation and editing

Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, et al. Deepgen 1.0: A lightweight unified multimodal model for advancing image generation and editing. arXiv preprint arXiv:2602.12205, 2026

work page arXiv 2026

[64] [64]

Internsvg: Towards unified svg tasks with multimodal large language models

Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, Yanwen Guo, Wenhai Wang, Kai Chen, Yu Qiao, and Hongjie Zhang. Internsvg: Towards unified svg tasks with multimodal large language models. arXiv preprint arXiv:2510.11341, 2025

work page arXiv 2025

[65] [65]

Mv-math: Evaluating multimodal math reasoning in multi-visual contexts

Peijie Wang, Zhong-Zhi Li, Fei Yin, Dekang Ran, and Cheng-Lin Liu. Mv-math: Evaluating multimodal math reasoning in multi-visual contexts. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19541–19551, 2025

2025

[66] [66]

GenExam: A Multidisciplinary Text-to-Image Exam

Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, and Gen Luo. Genexam: A multidisciplinary text-to-image exam. arXiv preprint arXiv:2509.14232, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [67]

Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau

Zijie J. Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models, 2022

2022

[68] [68]

Omniedit: Building image editing generalist models through specialist supervision, 2024

Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, and Wenhu Chen. Omniedit: Building image editing generalist models through specialist supervision, 2024. 14

2024

[69] [69]

Smiles, a chemical language and information system

David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences, 28(1):31–36, 1988

1988

[70] [70]

Qwen-image technical report, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

2025

[71] [71]

Tunesformer: Forming irish tunes with control codes by bar patching

Shangda Wu, Xiaobing Li, Feng Yu, and Maosong Sun. Tunesformer: Forming irish tunes with control codes by bar patching. arXiv preprint arXiv:2301.02884, 2023

work page arXiv 2023

[72] [72]

Dreamomni: Unified image generation and editing, 2025

Bin Xia, Yuechen Zhang, Jingyao Li, Chengyao Wang, Yitong Wang, Xinglong Wu, Bei Yu, and Jiaya Jia. Dreamomni: Unified image generation and editing, 2025

2025

[73] [73]

Omnigen: Unified image generation

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025

2025

[74] [74]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. arXiv preprint arXiv:2506.15564, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[75] [75]

Scaling autoregressive models for content-rich text-to-image generation, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022

2022

[76] [76]

Anyedit: Mastering unified high-quality image editing for any idea, 2024

Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea, 2024

2024

[77] [77]

Implementation and benchmarking of perceptual image hash functions

Christoph Zauner. Implementation and benchmarking of perceptual image hash functions. Master’s thesis, University of Applied Sciences Hagenberg, 2010

2010

[78] [78]

The chembl database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods

Barbara Zdrazil, Eloy Felix, Fiona Hunter, Emma J Manners, James Blackshaw, Sybilla Corbett, Marleen de Veij, Harris Ioannidis, David Mendez Lopez, Juan F Mosquera, Maria Paula Magarinos, Nicolas Bosc, Ricardo Arcila, Tevfik Kiziloren, Anna Gaulton, A Patricia Bento, Melissa F Adasme, Peter Monecke, Gregory A Landrum, and Andrew R Leach. The chembl databa...

2023

[79] [79]

Magicbrush: A manually annotated dataset for instruction-guided image editing, 2023

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing, 2023

2023

[80] [80]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

2023