DisciplineGen-1M: A Large-Scale Dataset for Multidisciplinary Visual Generation and Editing
Pith reviewed 2026-07-03 15:37 UTC · model grok-4.3
The pith
DisciplineGen-1M supplies 1.2 million multidisciplinary samples to train image models on accurate diagrams and edits instead of plausible visuals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DisciplineGen-1M contains 1.2M samples across mathematics, physics, chemistry, biology, geography, computer science, economics, history, music, and sports. Its construction framework of vector-graphics rendering, OCR-based editing, curated programmatic synthesis, and large-scale text-to-image filtering yields captions, editing instructions, structured annotations, and image pairs with controllable semantic differences. A discipline-informed model trained on the dataset produces substantial gains on GenExam and GRADE while transferring to WISE and RISE, supporting the claim that such data moves generation toward knowledge-grounded creation.
What carries the argument
The scalable construction framework that combines vector-graphics rendering, OCR-based editing, curated programmatic synthesis, and large-scale text-to-image filtering to generate paired images, captions, editing instructions, and structured annotations with controllable semantic differences.
If this is right
- Models trained on the dataset improve performance on discipline-related benchmarks GenExam and GRADE.
- The training approach transfers to gains on general reasoning-informed benchmarks WISE and RISE.
- The dataset directly supports both text-to-image generation and image editing with knowledge-grounded outputs.
- Public release of the dataset, model, and curation pipeline enables reproducibility and extension by others.
Where Pith is reading between the lines
- The controllable differences in paired samples could let researchers isolate which spatial or symbolic features most affect model accuracy.
- Similar curation methods might scale to additional domains such as engineering diagrams or medical illustrations if the same pipeline components are adapted.
- Widespread use of such datasets could reduce reliance on post-hoc correction in educational or scientific visualization tools.
- The emphasis on verifiable correctness over visual appeal alone may encourage new evaluation metrics focused on conceptual fidelity.
Load-bearing premise
The construction pipelines produce samples whose correctness matches disciplinary concepts and whose semantic differences are controllable enough to drive measurable model gains.
What would settle it
Training a model on DisciplineGen-1M and finding no improvement or a decline on the GenExam and GRADE benchmarks relative to open-source baselines trained without the data would falsify the claim of measurable gains from this structured academic visual data.
read the original abstract
Recent image generation and editing models can produce visually appealing natural images, yet they remain unreliable when the target image is a knowledge-intensive diagram whose correctness depends on disciplinary concepts, symbolic structure, and precise spatial relations. We introduce DisciplineGen-1M, a million-scale multidisciplinary dataset that supports text-to-image generation and image editing. It contains 1.2M samples spanning mathematics, physics, chemistry, biology, geography, computer science, economics, history, music, and sports. To construct the dataset, we design a scalable framework that combines vector-graphics rendering, OCR-based editing, curated programmatic synthesis, and large-scale text-to-image filtering. These pipelines produce captions, editing instructions, structured annotations, and paired images with controllable semantic differences. Building on DisciplineGen-1M, we further introduce a discipline-informed reasoning-generation model for both text-to-image generation and image editing. Experiments on discipline-related benchmarks, GenExam and GRADE, show substantial improvements over open-source baselines, while evaluations on general reasoning-informed benchmarks, WISE and RISE, further indicate broader transfer. The results suggest that large-scale structured academic visual data is a key ingredient for moving image generation from aesthetic plausibility toward verifiable knowledge-grounded visual creation. We will publicly release our dataset, model, and source code of the data curation pipeline to ensure reproducibility and benefit future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DisciplineGen-1M, a 1.2M-sample multidisciplinary dataset spanning 10 fields for text-to-image generation and editing. It is built via vector-graphics rendering, OCR-based editing, curated programmatic synthesis, and large-scale T2I filtering, and is paired with a discipline-informed reasoning-generation model. Experiments claim substantial gains on GenExam and GRADE plus transfer to WISE and RISE, supporting the thesis that large-scale structured academic visual data enables verifiable knowledge-grounded image creation.
Significance. If the dataset samples are verifiably correct and the benchmark gains are reproducible, the work would supply a concrete resource and modeling approach for shifting image generation toward disciplinary fidelity rather than mere visual plausibility.
major comments (2)
- [Abstract] Abstract: the assertion of 'substantial improvements' on GenExam and GRADE is unsupported by any quantitative numbers, error bars, or factual-correctness verification in the provided text; without these the central claim that the dataset drives 'verifiable knowledge-grounded' gains cannot be evaluated.
- [Construction framework] Construction framework (T2I filtering pipeline): the manuscript acknowledges that existing T2I models 'remain unreliable when the target image is a knowledge-intensive diagram whose correctness depends on disciplinary concepts,' yet describes no post-filter verification, expert review, or automated symbolic/spatial checks; if a non-negligible fraction of the 1.2M samples originates from this step, measured benchmark gains cannot be attributed to the claimed correctness property.
minor comments (2)
- Clarify the exact fraction of samples produced by each pipeline (vector-graphics, OCR, programmatic, T2I) so readers can assess the weight of the unverified filtering component.
- Add explicit citations and brief descriptions for GenExam, GRADE, WISE, and RISE in the abstract and introduction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract and data construction pipeline. We address each major comment below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion of 'substantial improvements' on GenExam and GRADE is unsupported by any quantitative numbers, error bars, or factual-correctness verification in the provided text; without these the central claim that the dataset drives 'verifiable knowledge-grounded' gains cannot be evaluated.
Authors: We agree that the abstract should include quantitative support. The submitted abstract summarized the experimental outcomes without specific metrics. In the revised manuscript we will expand the abstract to report the concrete improvements (e.g., accuracy deltas and standard deviations) on GenExam and GRADE relative to the open-source baselines, thereby grounding the claim of substantial gains. revision: yes
-
Referee: [Construction framework] Construction framework (T2I filtering pipeline): the manuscript acknowledges that existing T2I models 'remain unreliable when the target image is a knowledge-intensive diagram whose correctness depends on disciplinary concepts,' yet describes no post-filter verification, expert review, or automated symbolic/spatial checks; if a non-negligible fraction of the 1.2M samples originates from this step, measured benchmark gains cannot be attributed to the claimed correctness property.
Authors: The concern is valid. The T2I filtering step is one of four construction pipelines and the manuscript does not detail post-filter verification procedures. We will revise the construction section to (i) report the approximate fraction of samples produced by each pipeline, (ii) describe any automated spatial or symbolic consistency checks that were applied during filtering, and (iii) add an explicit limitations paragraph discussing the residual risk that some T2I-generated samples may contain factual inaccuracies. The benchmark gains are measured on held-out discipline-specific tests; we will clarify that these results provide empirical support but do not constitute exhaustive verification of every sample. revision: partial
Circularity Check
No circularity: dataset construction and benchmark results are independent of self-referential definitions or fits
full rationale
The paper describes a data curation pipeline (vector-graphics rendering, OCR editing, programmatic synthesis, T2I filtering) and reports gains on GenExam/GRADE and WISE/RISE benchmarks. No equations, fitted parameters, predictions, or uniqueness theorems appear. The central claim that structured academic data enables knowledge-grounded generation rests on external benchmark measurements rather than any reduction to the paper's own inputs by construction. No self-citations are invoked as load-bearing premises. This is a standard non-circular empirical contribution.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Biochemistry Free For All
Kevin Ahern, Indira Rajagopal, and Taralyn Tan. Biochemistry Free For All. Oregon State University, Corvallis, OR, 2018. Open textbook licensed under a Creative Commons Attribution-NonCommercial license
2018
-
[2]
historical-basemaps, 2026
aourednik. historical-basemaps, 2026. Online; accessed 14-May-2026
2026
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, 11 Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixu...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Automatikz: Text-guided synthesis of scientific vector graphics with tikz
Jonas Belouadi, Anne Lauscher, and Steffen Eger. Automatikz: Text-guided synthesis of scientific vector graphics with tikz. In International Conference on Learning Representations, volume 2024, pages 55917–55943, 2024
2024
-
[5]
Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions, 2022
2022
-
[6]
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021
2021
-
[7]
Scaleedit-12m: Scaling open-source image editing data generation via multi-agent framework, 2026
Guanzhou Chen, Erfei Cui, Changyao Tian, Danni Yang, Ganlin Yang, Yu Qiao, Hongsheng Li, Gen Luo, and Hongjie Zhang. Scaleedit-12m: Scaling open-source image editing data generation via multi-agent framework, 2026
2026
-
[8]
Janus-pro: Unified multimodal understanding and generation with data and model scaling, 2025
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling, 2025
2025
-
[9]
Wikimedia commons, 2026
Wikimedia Commons. Wikimedia commons, 2026. Online; accessed 14-May-2026
2026
-
[10]
PaddleOCR 3.0 Technical Report
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Diffusion models beat gans on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021
2021
-
[13]
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, et al. Sensenova-u1: Unifying multimodal understanding and generation with neo-unify architecture. arXiv preprint arXiv:2605.12500, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[14]
Gemini 3 flash: frontier intelligence built for speed
Google. Gemini 3 flash: frontier intelligence built for speed. https://blog.google/products-and-platforms/products/ gemini/gemini-3-flash/, 2025
2025
-
[15]
Introducing nano banana pro.https://blog.google/innovation-and-ai/products/nano-banana-pro/, 2025
Google. Introducing nano banana pro.https://blog.google/innovation-and-ai/products/nano-banana-pro/, 2025
2025
-
[16]
Nano banana 2: Combining pro capabilities with lightning-fast speed
Google. Nano banana 2: Combining pro capabilities with lightning-fast speed. https://blog.google/innovation-and-ai/ technology/ai/nano-banana-2/, 2025
2025
-
[17]
Greenlaw and David Shapiro
Steven A. Greenlaw and David Shapiro. Principles of Economics 2e. OpenStax, Houston, TX, 2017. Open textbook, licensed under CC BY 4.0
2017
-
[18]
Greenlaw and David Shapiro
Steven A. Greenlaw and David Shapiro. Principles of Macroeconomics 2e. OpenStax, Houston, TX, 2017. Open textbook, licensed under CC BY 4.0
2017
-
[19]
Greenlaw and David Shapiro
Steven A. Greenlaw and David Shapiro. Principles of Microeconomics 2e. OpenStax, Houston, TX, 2017. Open textbook, licensed under CC BY 4.0
2017
-
[20]
Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset
Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the maestro dataset. arXiv preprint arXiv:1810.12247, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[21]
Prompt-to-prompt image editing with cross attention control, 2022
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control, 2022
2022
-
[22]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020
2020
-
[23]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. 12
2022
-
[24]
Chemeval: A multi-level and fine-grained chemical capability evaluation for large language models, 2026
Yuqing Huang, Rongyang Zhang, Xuesong He, Xuyang Zhi, Hao Wang, Nuo Chen, Liuzongbo, Xin Li, Feiyang Xu, Deguang Liu, Huadong Liang, YiLi, Jian Cui, Yin Xu, Shijin Wang, Guiquan Liu, Qi Liu, Defu Lian, and Enhong Chen. Chemeval: A multi-level and fine-grained chemical capability evaluation for large language models, 2026
2026
-
[25]
Smartedit: Exploring complex instruction-based image editing with multimodal large language models, 2023
Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, and Ying Shan. Smartedit: Exploring complex instruction-based image editing with multimodal large language models, 2023
2023
-
[26]
Microbiology: Canadian Edition
Wendy Keenleyside. Microbiology: Canadian Edition. eCampusOntario, Toronto, ON, 2019. First edition; derived from OpenStax Microbiology; licensed under CC BY 4.0, except where otherwise noted
2019
-
[27]
John W. Kimball. Kimball’s biology pages.https://www.biology-pages.info/, 2025. Online biology textbook/reference distributed under CC BY 3.0
2025
-
[28]
Vectoredits: A dataset and benchmark for instruction-based editing of vector graphics, 2025
Josef Kuchar, Marek Kadlcik, Michal Spiegel, and Michal stefanik. Vectoredits: A dataset and benchmark for instruction-based editing of vector graphics, 2025
2025
-
[29]
FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025
Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025
2025
-
[30]
Rdkit: Open-source cheminformatics, 2025
Greg Landrum. Rdkit: Open-source cheminformatics, 2025
2025
-
[31]
S1-omni-image: A unified model for scientific image understanding, generation, and editing, 2026
Qingxiao Li, Zikai Wang, Qingli Wang, and Nan Xu. S1-omni-image: A unified model for scientific image understanding, generation, and editing, 2026
2026
-
[32]
Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning
Juekai Lin, Yun Zhu, Honglin Lin, Sijing Li, Tianwei Lin, Zheng Liu, Xiaoyang Wang, Wenqiao Zhang, and Lijun Wu. Scientific graphics program synthesis via dual self-consistency reinforcement learning. arXiv preprint arXiv:2604.06079, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[33]
Grade: Benchmarking discipline-informed reasoning in image editing
Mingxin Liu, Ziqian Fan, Zhaokai Wang, Leyao Gu, Zirun Zhu, Yiguo He, Yuchen Yang, Changyao Tian, Xiangyu Zhao, Ning Liao, et al. Grade: Benchmarking discipline-informed reasoning in image editing. arXiv preprint arXiv:2603.12264, 2026
-
[34]
Step1X-Edit: A Practical Framework for General Image Editing
Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2021
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2021
2021
-
[36]
Wise: A world knowledge-informed semantic evaluation for text-to-image generation, 2025
Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, and Li Yuan. Wise: A world knowledge-informed semantic evaluation for text-to-image generation, 2025
2025
-
[37]
Gpt-image-1
OpenAI. Gpt-image-1. https://openai.com/index/image-generation-api/, 2025
2025
-
[38]
Gpt-image-1.5
OpenAI. Gpt-image-1.5. https://openai.com/zh-Hans-CN/index/new-chatgpt-images-is-here/, 2025
2025
-
[39]
Gpt-image-2.https://openai.com/, 2026
OpenAI. Gpt-image-2.https://openai.com/, 2026. AI image generation model
2026
-
[40]
Pavlov, M
D. Pavlov, M. Rybalkin, B. Karulin, M. Kozhevnikov, A. Savelyev, and A. Churinov. Indigo: universal cheminformatics api. Journal of Cheminformatics, 3(1):P4, 2011
2011
-
[41]
Bizgen: Advancing article-level visual text rendering for infographics generation, 2025
Yuyang Peng, Shishi Xiao, Keming Wu, Qisheng Liao, Bohan Chen, Kevin Lin, Danqing Huang, Ji Li, and Yuhui Yuan. Bizgen: Advancing article-level visual text rendering for infographics generation, 2025
2025
-
[42]
Pico-banana-400k: A large-scale dataset for text-guided image editing, 2025
Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-banana-400k: A large-scale dataset for text-guided image editing, 2025
2025
-
[43]
Zeju Qiu, Weiyang Liu, Haiwen Feng, Zhen Liu, Tim Z. Xiao, Katherine M. Collins, Joshua B. Tenenbaum, Adrian Weller, Michael J. Black, and Bernhard Schölkopf. Can large language models understand symbolic graphics programs? arXiv preprint arXiv:2408.08313, 2024
-
[44]
Learning-based methods for comparing sequences, with applications to audio-to-midi alignment and matching
Colin Raffel. Learning-based methods for comparing sequences, with applications to audio-to-midi alignment and matching. Columbia University, 2016
2016
-
[45]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022
2022
-
[46]
Photorealistic text-to-image diffusion models with deep language understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. 13
2022
-
[47]
Nadine Schneider, Nikolaus Stiefl, and Gregory A. Landrum. What’s what: The (nearly) definitive guide to reaction role assignment. Journal of Chemical Information and Modeling, 56(12):2336–2346, 2016
2016
-
[48]
Laion-5b: An open large-scale dataset for training next generation image-text models, 2022
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022
2022
-
[49]
Seedream 4.5.https://seed.bytedance.com/en/seedream4_5/, 2025
Seedream. Seedream 4.5.https://seed.bytedance.com/en/seedream4_5/, 2025
2025
-
[50]
Seedream 5.0.https://seed.bytedance.com/en/seedream5_0_lite/, 2026
Seedream. Seedream 5.0.https://seed.bytedance.com/en/seedream5_0_lite/, 2026
2026
-
[51]
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Mathcanvas: Intrinsic visual chain-of-thought for multimodal mathematical reasoning, 2025
Weikang Shi, Aldrich Yu, Rongyao Fang, Houxing Ren, Ke Wang, Aojun Zhou, Changyao Tian, Xinyu Fu, Yuxuan Hu, Zimu Lu, Linjiang Huang, Si Liu, Rui Liu, and Hongsheng Li. Mathcanvas: Intrinsic visual chain-of-thought for multimodal mathematical reasoning, 2025
2025
-
[53]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
From pixels to prose: A large dataset of dense image captions, 2024
Vasu Singla, Kaiyu Yue, Sukriti Paul, Reza Shirkavand, Mayuka Jayawardhana, Alireza Ganjdanesh, Heng Huang, Abhinav Bhatele, Gowthami Somepalli, and Tom Goldstein. From pixels to prose: A large dataset of dense image captions, 2024
2024
-
[55]
Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning, 2021
Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning, 2021
2021
-
[56]
Generative multimodal models are in-context learners
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14398–14409, 2024
2024
-
[57]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models, 2024. URL https://arxiv. org/abs/2405.09818, 9(8), 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
GeoGebra, 2026
GeoGebra Team. GeoGebra, 2026. Online; accessed 14-May-2026
2026
-
[59]
Qwen3-max: Just scale it, September 2025
Qwen Team. Qwen3-max: Just scale it, September 2025
2025
-
[60]
Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, et al. Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing. arXiv preprint arXiv:2603.09877, 2026
-
[61]
K12 dataset.https://huggingface.co/datasets/WaltonFuture/K12, 2026
WaltonFuture. K12 dataset.https://huggingface.co/datasets/WaltonFuture/K12, 2026. Accessed: 2026-05-27
2026
-
[62]
Textatlas5m: A large-scale dataset for dense text image generation, 2025
Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, Lijuan Wang, and Min Li. Textatlas5m: A large-scale dataset for dense text image generation, 2025
2025
-
[63]
Deepgen 1.0: A lightweight unified multimodal model for advancing image generation and editing
Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, et al. Deepgen 1.0: A lightweight unified multimodal model for advancing image generation and editing. arXiv preprint arXiv:2602.12205, 2026
-
[64]
Internsvg: Towards unified svg tasks with multimodal large language models
Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, Yanwen Guo, Wenhai Wang, Kai Chen, Yu Qiao, and Hongjie Zhang. Internsvg: Towards unified svg tasks with multimodal large language models. arXiv preprint arXiv:2510.11341, 2025
-
[65]
Mv-math: Evaluating multimodal math reasoning in multi-visual contexts
Peijie Wang, Zhong-Zhi Li, Fei Yin, Dekang Ran, and Cheng-Lin Liu. Mv-math: Evaluating multimodal math reasoning in multi-visual contexts. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19541–19551, 2025
2025
-
[66]
GenExam: A Multidisciplinary Text-to-Image Exam
Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, and Gen Luo. Genexam: A multidisciplinary text-to-image exam. arXiv preprint arXiv:2509.14232, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau
Zijie J. Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models, 2022
2022
-
[68]
Omniedit: Building image editing generalist models through specialist supervision, 2024
Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, and Wenhu Chen. Omniedit: Building image editing generalist models through specialist supervision, 2024. 14
2024
-
[69]
Smiles, a chemical language and information system
David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences, 28(1):31–36, 1988
1988
-
[70]
Qwen-image technical report, 2025
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...
2025
-
[71]
Tunesformer: Forming irish tunes with control codes by bar patching
Shangda Wu, Xiaobing Li, Feng Yu, and Maosong Sun. Tunesformer: Forming irish tunes with control codes by bar patching. arXiv preprint arXiv:2301.02884, 2023
-
[72]
Dreamomni: Unified image generation and editing, 2025
Bin Xia, Yuechen Zhang, Jingyao Li, Chengyao Wang, Yitong Wang, Xinglong Wu, Bei Yu, and Jiaya Jia. Dreamomni: Unified image generation and editing, 2025
2025
-
[73]
Omnigen: Unified image generation
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025
2025
-
[74]
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. arXiv preprint arXiv:2506.15564, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[75]
Scaling autoregressive models for content-rich text-to-image generation, 2022
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022
2022
-
[76]
Anyedit: Mastering unified high-quality image editing for any idea, 2024
Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea, 2024
2024
-
[77]
Implementation and benchmarking of perceptual image hash functions
Christoph Zauner. Implementation and benchmarking of perceptual image hash functions. Master’s thesis, University of Applied Sciences Hagenberg, 2010
2010
-
[78]
The chembl database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods
Barbara Zdrazil, Eloy Felix, Fiona Hunter, Emma J Manners, James Blackshaw, Sybilla Corbett, Marleen de Veij, Harris Ioannidis, David Mendez Lopez, Juan F Mosquera, Maria Paula Magarinos, Nicolas Bosc, Ricardo Arcila, Tevfik Kiziloren, Anna Gaulton, A Patricia Bento, Melissa F Adasme, Peter Monecke, Gregory A Landrum, and Andrew R Leach. The chembl databa...
2023
-
[79]
Magicbrush: A manually annotated dataset for instruction-guided image editing, 2023
Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing, 2023
2023
-
[80]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.