pith. sign in

arxiv: 2607.02290 · v1 · pith:ZNQ5SZBEnew · submitted 2026-07-02 · 💻 cs.CV

DisciplineGen-1M: A Large-Scale Dataset for Multidisciplinary Visual Generation and Editing

Pith reviewed 2026-07-03 15:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords DisciplineGen-1Mmultidisciplinary datasetvisual generationimage editingknowledge-grounded generationtext-to-imageacademic diagramsstructured annotations
0
0 comments X

The pith

DisciplineGen-1M supplies 1.2 million multidisciplinary samples to train image models on accurate diagrams and edits instead of plausible visuals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a million-scale dataset drawn from ten academic fields to address the failure of current image generators on knowledge-intensive content. It claims that structured visual data tied to disciplinary concepts, symbolic structures, and precise spatial relations can shift generation toward verifiable correctness. The authors construct the samples through combined vector-graphics rendering, OCR editing, programmatic synthesis, and filtering pipelines that also supply captions, instructions, and controllable paired images. They train a reasoning-generation model on the data and report gains on discipline-specific benchmarks plus transfer to broader reasoning tests. The work positions large structured academic visuals as essential for moving beyond aesthetic outputs.

Core claim

DisciplineGen-1M contains 1.2M samples across mathematics, physics, chemistry, biology, geography, computer science, economics, history, music, and sports. Its construction framework of vector-graphics rendering, OCR-based editing, curated programmatic synthesis, and large-scale text-to-image filtering yields captions, editing instructions, structured annotations, and image pairs with controllable semantic differences. A discipline-informed model trained on the dataset produces substantial gains on GenExam and GRADE while transferring to WISE and RISE, supporting the claim that such data moves generation toward knowledge-grounded creation.

What carries the argument

The scalable construction framework that combines vector-graphics rendering, OCR-based editing, curated programmatic synthesis, and large-scale text-to-image filtering to generate paired images, captions, editing instructions, and structured annotations with controllable semantic differences.

If this is right

  • Models trained on the dataset improve performance on discipline-related benchmarks GenExam and GRADE.
  • The training approach transfers to gains on general reasoning-informed benchmarks WISE and RISE.
  • The dataset directly supports both text-to-image generation and image editing with knowledge-grounded outputs.
  • Public release of the dataset, model, and curation pipeline enables reproducibility and extension by others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The controllable differences in paired samples could let researchers isolate which spatial or symbolic features most affect model accuracy.
  • Similar curation methods might scale to additional domains such as engineering diagrams or medical illustrations if the same pipeline components are adapted.
  • Widespread use of such datasets could reduce reliance on post-hoc correction in educational or scientific visualization tools.
  • The emphasis on verifiable correctness over visual appeal alone may encourage new evaluation metrics focused on conceptual fidelity.

Load-bearing premise

The construction pipelines produce samples whose correctness matches disciplinary concepts and whose semantic differences are controllable enough to drive measurable model gains.

What would settle it

Training a model on DisciplineGen-1M and finding no improvement or a decline on the GenExam and GRADE benchmarks relative to open-source baselines trained without the data would falsify the claim of measurable gains from this structured academic visual data.

read the original abstract

Recent image generation and editing models can produce visually appealing natural images, yet they remain unreliable when the target image is a knowledge-intensive diagram whose correctness depends on disciplinary concepts, symbolic structure, and precise spatial relations. We introduce DisciplineGen-1M, a million-scale multidisciplinary dataset that supports text-to-image generation and image editing. It contains 1.2M samples spanning mathematics, physics, chemistry, biology, geography, computer science, economics, history, music, and sports. To construct the dataset, we design a scalable framework that combines vector-graphics rendering, OCR-based editing, curated programmatic synthesis, and large-scale text-to-image filtering. These pipelines produce captions, editing instructions, structured annotations, and paired images with controllable semantic differences. Building on DisciplineGen-1M, we further introduce a discipline-informed reasoning-generation model for both text-to-image generation and image editing. Experiments on discipline-related benchmarks, GenExam and GRADE, show substantial improvements over open-source baselines, while evaluations on general reasoning-informed benchmarks, WISE and RISE, further indicate broader transfer. The results suggest that large-scale structured academic visual data is a key ingredient for moving image generation from aesthetic plausibility toward verifiable knowledge-grounded visual creation. We will publicly release our dataset, model, and source code of the data curation pipeline to ensure reproducibility and benefit future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DisciplineGen-1M, a 1.2M-sample multidisciplinary dataset spanning 10 fields for text-to-image generation and editing. It is built via vector-graphics rendering, OCR-based editing, curated programmatic synthesis, and large-scale T2I filtering, and is paired with a discipline-informed reasoning-generation model. Experiments claim substantial gains on GenExam and GRADE plus transfer to WISE and RISE, supporting the thesis that large-scale structured academic visual data enables verifiable knowledge-grounded image creation.

Significance. If the dataset samples are verifiably correct and the benchmark gains are reproducible, the work would supply a concrete resource and modeling approach for shifting image generation toward disciplinary fidelity rather than mere visual plausibility.

major comments (2)
  1. [Abstract] Abstract: the assertion of 'substantial improvements' on GenExam and GRADE is unsupported by any quantitative numbers, error bars, or factual-correctness verification in the provided text; without these the central claim that the dataset drives 'verifiable knowledge-grounded' gains cannot be evaluated.
  2. [Construction framework] Construction framework (T2I filtering pipeline): the manuscript acknowledges that existing T2I models 'remain unreliable when the target image is a knowledge-intensive diagram whose correctness depends on disciplinary concepts,' yet describes no post-filter verification, expert review, or automated symbolic/spatial checks; if a non-negligible fraction of the 1.2M samples originates from this step, measured benchmark gains cannot be attributed to the claimed correctness property.
minor comments (2)
  1. Clarify the exact fraction of samples produced by each pipeline (vector-graphics, OCR, programmatic, T2I) so readers can assess the weight of the unverified filtering component.
  2. Add explicit citations and brief descriptions for GenExam, GRADE, WISE, and RISE in the abstract and introduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and data construction pipeline. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion of 'substantial improvements' on GenExam and GRADE is unsupported by any quantitative numbers, error bars, or factual-correctness verification in the provided text; without these the central claim that the dataset drives 'verifiable knowledge-grounded' gains cannot be evaluated.

    Authors: We agree that the abstract should include quantitative support. The submitted abstract summarized the experimental outcomes without specific metrics. In the revised manuscript we will expand the abstract to report the concrete improvements (e.g., accuracy deltas and standard deviations) on GenExam and GRADE relative to the open-source baselines, thereby grounding the claim of substantial gains. revision: yes

  2. Referee: [Construction framework] Construction framework (T2I filtering pipeline): the manuscript acknowledges that existing T2I models 'remain unreliable when the target image is a knowledge-intensive diagram whose correctness depends on disciplinary concepts,' yet describes no post-filter verification, expert review, or automated symbolic/spatial checks; if a non-negligible fraction of the 1.2M samples originates from this step, measured benchmark gains cannot be attributed to the claimed correctness property.

    Authors: The concern is valid. The T2I filtering step is one of four construction pipelines and the manuscript does not detail post-filter verification procedures. We will revise the construction section to (i) report the approximate fraction of samples produced by each pipeline, (ii) describe any automated spatial or symbolic consistency checks that were applied during filtering, and (iii) add an explicit limitations paragraph discussing the residual risk that some T2I-generated samples may contain factual inaccuracies. The benchmark gains are measured on held-out discipline-specific tests; we will clarify that these results provide empirical support but do not constitute exhaustive verification of every sample. revision: partial

Circularity Check

0 steps flagged

No circularity: dataset construction and benchmark results are independent of self-referential definitions or fits

full rationale

The paper describes a data curation pipeline (vector-graphics rendering, OCR editing, programmatic synthesis, T2I filtering) and reports gains on GenExam/GRADE and WISE/RISE benchmarks. No equations, fitted parameters, predictions, or uniqueness theorems appear. The central claim that structured academic data enables knowledge-grounded generation rests on external benchmark measurements rather than any reduction to the paper's own inputs by construction. No self-citations are invoked as load-bearing premises. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all construction details are described at the level of high-level pipelines without numerical fitting or new postulated objects.

pith-pipeline@v0.9.1-grok · 5820 in / 1050 out tokens · 19096 ms · 2026-07-03T15:37:00.357195+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

117 extracted references · 20 canonical work pages · 13 internal anchors

  1. [1]

    Biochemistry Free For All

    Kevin Ahern, Indira Rajagopal, and Taralyn Tan. Biochemistry Free For All. Oregon State University, Corvallis, OR, 2018. Open textbook licensed under a Creative Commons Attribution-NonCommercial license

  2. [2]

    historical-basemaps, 2026

    aourednik. historical-basemaps, 2026. Online; accessed 14-May-2026

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, 11 Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixu...

  4. [4]

    Automatikz: Text-guided synthesis of scientific vector graphics with tikz

    Jonas Belouadi, Anne Lauscher, and Steffen Eger. Automatikz: Text-guided synthesis of scientific vector graphics with tikz. In International Conference on Learning Representations, volume 2024, pages 55917–55943, 2024

  5. [5]

    Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions, 2022

  6. [6]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021

  7. [7]

    Scaleedit-12m: Scaling open-source image editing data generation via multi-agent framework, 2026

    Guanzhou Chen, Erfei Cui, Changyao Tian, Danni Yang, Ganlin Yang, Yu Qiao, Hongsheng Li, Gen Luo, and Hongjie Zhang. Scaleedit-12m: Scaling open-source image editing data generation via multi-agent framework, 2026

  8. [8]

    Janus-pro: Unified multimodal understanding and generation with data and model scaling, 2025

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling, 2025

  9. [9]

    Wikimedia commons, 2026

    Wikimedia Commons. Wikimedia commons, 2026. Online; accessed 14-May-2026

  10. [10]

    PaddleOCR 3.0 Technical Report

    Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025

  11. [11]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025

  12. [12]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021

  13. [13]

    SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, et al. Sensenova-u1: Unifying multimodal understanding and generation with neo-unify architecture. arXiv preprint arXiv:2605.12500, 2026

  14. [14]

    Gemini 3 flash: frontier intelligence built for speed

    Google. Gemini 3 flash: frontier intelligence built for speed. https://blog.google/products-and-platforms/products/ gemini/gemini-3-flash/, 2025

  15. [15]

    Introducing nano banana pro.https://blog.google/innovation-and-ai/products/nano-banana-pro/, 2025

    Google. Introducing nano banana pro.https://blog.google/innovation-and-ai/products/nano-banana-pro/, 2025

  16. [16]

    Nano banana 2: Combining pro capabilities with lightning-fast speed

    Google. Nano banana 2: Combining pro capabilities with lightning-fast speed. https://blog.google/innovation-and-ai/ technology/ai/nano-banana-2/, 2025

  17. [17]

    Greenlaw and David Shapiro

    Steven A. Greenlaw and David Shapiro. Principles of Economics 2e. OpenStax, Houston, TX, 2017. Open textbook, licensed under CC BY 4.0

  18. [18]

    Greenlaw and David Shapiro

    Steven A. Greenlaw and David Shapiro. Principles of Macroeconomics 2e. OpenStax, Houston, TX, 2017. Open textbook, licensed under CC BY 4.0

  19. [19]

    Greenlaw and David Shapiro

    Steven A. Greenlaw and David Shapiro. Principles of Microeconomics 2e. OpenStax, Houston, TX, 2017. Open textbook, licensed under CC BY 4.0

  20. [20]

    Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

    Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the maestro dataset. arXiv preprint arXiv:1810.12247, 2018

  21. [21]

    Prompt-to-prompt image editing with cross attention control, 2022

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control, 2022

  22. [22]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

  23. [23]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. 12

  24. [24]

    Chemeval: A multi-level and fine-grained chemical capability evaluation for large language models, 2026

    Yuqing Huang, Rongyang Zhang, Xuesong He, Xuyang Zhi, Hao Wang, Nuo Chen, Liuzongbo, Xin Li, Feiyang Xu, Deguang Liu, Huadong Liang, YiLi, Jian Cui, Yin Xu, Shijin Wang, Guiquan Liu, Qi Liu, Defu Lian, and Enhong Chen. Chemeval: A multi-level and fine-grained chemical capability evaluation for large language models, 2026

  25. [25]

    Smartedit: Exploring complex instruction-based image editing with multimodal large language models, 2023

    Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, and Ying Shan. Smartedit: Exploring complex instruction-based image editing with multimodal large language models, 2023

  26. [26]

    Microbiology: Canadian Edition

    Wendy Keenleyside. Microbiology: Canadian Edition. eCampusOntario, Toronto, ON, 2019. First edition; derived from OpenStax Microbiology; licensed under CC BY 4.0, except where otherwise noted

  27. [27]

    John W. Kimball. Kimball’s biology pages.https://www.biology-pages.info/, 2025. Online biology textbook/reference distributed under CC BY 3.0

  28. [28]

    Vectoredits: A dataset and benchmark for instruction-based editing of vector graphics, 2025

    Josef Kuchar, Marek Kadlcik, Michal Spiegel, and Michal stefanik. Vectoredits: A dataset and benchmark for instruction-based editing of vector graphics, 2025

  29. [29]

    FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

    Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

  30. [30]

    Rdkit: Open-source cheminformatics, 2025

    Greg Landrum. Rdkit: Open-source cheminformatics, 2025

  31. [31]

    S1-omni-image: A unified model for scientific image understanding, generation, and editing, 2026

    Qingxiao Li, Zikai Wang, Qingli Wang, and Nan Xu. S1-omni-image: A unified model for scientific image understanding, generation, and editing, 2026

  32. [32]

    Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

    Juekai Lin, Yun Zhu, Honglin Lin, Sijing Li, Tianwei Lin, Zheng Liu, Xiaoyang Wang, Wenqiao Zhang, and Lijun Wu. Scientific graphics program synthesis via dual self-consistency reinforcement learning. arXiv preprint arXiv:2604.06079, 2026

  33. [33]

    Grade: Benchmarking discipline-informed reasoning in image editing

    Mingxin Liu, Ziqian Fan, Zhaokai Wang, Leyao Gu, Zirun Zhu, Yiguo He, Yuchen Yang, Changyao Tian, Xiangyu Zhao, Ning Liao, et al. Grade: Benchmarking discipline-informed reasoning in image editing. arXiv preprint arXiv:2603.12264, 2026

  34. [34]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025

  35. [35]

    Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2021

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2021

  36. [36]

    Wise: A world knowledge-informed semantic evaluation for text-to-image generation, 2025

    Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, and Li Yuan. Wise: A world knowledge-informed semantic evaluation for text-to-image generation, 2025

  37. [37]

    Gpt-image-1

    OpenAI. Gpt-image-1. https://openai.com/index/image-generation-api/, 2025

  38. [38]

    Gpt-image-1.5

    OpenAI. Gpt-image-1.5. https://openai.com/zh-Hans-CN/index/new-chatgpt-images-is-here/, 2025

  39. [39]

    Gpt-image-2.https://openai.com/, 2026

    OpenAI. Gpt-image-2.https://openai.com/, 2026. AI image generation model

  40. [40]

    Pavlov, M

    D. Pavlov, M. Rybalkin, B. Karulin, M. Kozhevnikov, A. Savelyev, and A. Churinov. Indigo: universal cheminformatics api. Journal of Cheminformatics, 3(1):P4, 2011

  41. [41]

    Bizgen: Advancing article-level visual text rendering for infographics generation, 2025

    Yuyang Peng, Shishi Xiao, Keming Wu, Qisheng Liao, Bohan Chen, Kevin Lin, Danqing Huang, Ji Li, and Yuhui Yuan. Bizgen: Advancing article-level visual text rendering for infographics generation, 2025

  42. [42]

    Pico-banana-400k: A large-scale dataset for text-guided image editing, 2025

    Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-banana-400k: A large-scale dataset for text-guided image editing, 2025

  43. [43]

    Xiao, Katherine M

    Zeju Qiu, Weiyang Liu, Haiwen Feng, Zhen Liu, Tim Z. Xiao, Katherine M. Collins, Joshua B. Tenenbaum, Adrian Weller, Michael J. Black, and Bernhard Schölkopf. Can large language models understand symbolic graphics programs? arXiv preprint arXiv:2408.08313, 2024

  44. [44]

    Learning-based methods for comparing sequences, with applications to audio-to-midi alignment and matching

    Colin Raffel. Learning-based methods for comparing sequences, with applications to audio-to-midi alignment and matching. Columbia University, 2016

  45. [45]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

  46. [46]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. 13

  47. [47]

    Nadine Schneider, Nikolaus Stiefl, and Gregory A. Landrum. What’s what: The (nearly) definitive guide to reaction role assignment. Journal of Chemical Information and Modeling, 56(12):2336–2346, 2016

  48. [48]

    Laion-5b: An open large-scale dataset for training next generation image-text models, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022

  49. [49]

    Seedream 4.5.https://seed.bytedance.com/en/seedream4_5/, 2025

    Seedream. Seedream 4.5.https://seed.bytedance.com/en/seedream4_5/, 2025

  50. [50]

    Seedream 5.0.https://seed.bytedance.com/en/seedream5_0_lite/, 2026

    Seedream. Seedream 5.0.https://seed.bytedance.com/en/seedream5_0_lite/, 2026

  51. [51]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427, 2025

  52. [52]

    Mathcanvas: Intrinsic visual chain-of-thought for multimodal mathematical reasoning, 2025

    Weikang Shi, Aldrich Yu, Rongyao Fang, Houxing Ren, Ke Wang, Aojun Zhou, Changyao Tian, Xinyu Fu, Yuxuan Hu, Zimu Lu, Linjiang Huang, Si Liu, Rui Liu, and Hongsheng Li. Mathcanvas: Intrinsic visual chain-of-thought for multimodal mathematical reasoning, 2025

  53. [53]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267, 2025

  54. [54]

    From pixels to prose: A large dataset of dense image captions, 2024

    Vasu Singla, Kaiyu Yue, Sukriti Paul, Reza Shirkavand, Mayuka Jayawardhana, Alireza Ganjdanesh, Heng Huang, Abhinav Bhatele, Gowthami Somepalli, and Tom Goldstein. From pixels to prose: A large dataset of dense image captions, 2024

  55. [55]

    Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning, 2021

    Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning, 2021

  56. [56]

    Generative multimodal models are in-context learners

    Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14398–14409, 2024

  57. [57]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models, 2024. URL https://arxiv. org/abs/2405.09818, 9(8), 2024

  58. [58]

    GeoGebra, 2026

    GeoGebra Team. GeoGebra, 2026. Online; accessed 14-May-2026

  59. [59]

    Qwen3-max: Just scale it, September 2025

    Qwen Team. Qwen3-max: Just scale it, September 2025

  60. [60]

    Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing

    Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, et al. Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing. arXiv preprint arXiv:2603.09877, 2026

  61. [61]

    K12 dataset.https://huggingface.co/datasets/WaltonFuture/K12, 2026

    WaltonFuture. K12 dataset.https://huggingface.co/datasets/WaltonFuture/K12, 2026. Accessed: 2026-05-27

  62. [62]

    Textatlas5m: A large-scale dataset for dense text image generation, 2025

    Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, Lijuan Wang, and Min Li. Textatlas5m: A large-scale dataset for dense text image generation, 2025

  63. [63]

    Deepgen 1.0: A lightweight unified multimodal model for advancing image generation and editing

    Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, et al. Deepgen 1.0: A lightweight unified multimodal model for advancing image generation and editing. arXiv preprint arXiv:2602.12205, 2026

  64. [64]

    Internsvg: Towards unified svg tasks with multimodal large language models

    Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, Yanwen Guo, Wenhai Wang, Kai Chen, Yu Qiao, and Hongjie Zhang. Internsvg: Towards unified svg tasks with multimodal large language models. arXiv preprint arXiv:2510.11341, 2025

  65. [65]

    Mv-math: Evaluating multimodal math reasoning in multi-visual contexts

    Peijie Wang, Zhong-Zhi Li, Fei Yin, Dekang Ran, and Cheng-Lin Liu. Mv-math: Evaluating multimodal math reasoning in multi-visual contexts. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19541–19551, 2025

  66. [66]

    GenExam: A Multidisciplinary Text-to-Image Exam

    Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, and Gen Luo. Genexam: A multidisciplinary text-to-image exam. arXiv preprint arXiv:2509.14232, 2025

  67. [67]

    Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau

    Zijie J. Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models, 2022

  68. [68]

    Omniedit: Building image editing generalist models through specialist supervision, 2024

    Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, and Wenhu Chen. Omniedit: Building image editing generalist models through specialist supervision, 2024. 14

  69. [69]

    Smiles, a chemical language and information system

    David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences, 28(1):31–36, 1988

  70. [70]

    Qwen-image technical report, 2025

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

  71. [71]

    Tunesformer: Forming irish tunes with control codes by bar patching

    Shangda Wu, Xiaobing Li, Feng Yu, and Maosong Sun. Tunesformer: Forming irish tunes with control codes by bar patching. arXiv preprint arXiv:2301.02884, 2023

  72. [72]

    Dreamomni: Unified image generation and editing, 2025

    Bin Xia, Yuechen Zhang, Jingyao Li, Chengyao Wang, Yitong Wang, Xinglong Wu, Bei Yu, and Jiaya Jia. Dreamomni: Unified image generation and editing, 2025

  73. [73]

    Omnigen: Unified image generation

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025

  74. [74]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. arXiv preprint arXiv:2506.15564, 2025

  75. [75]

    Scaling autoregressive models for content-rich text-to-image generation, 2022

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022

  76. [76]

    Anyedit: Mastering unified high-quality image editing for any idea, 2024

    Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea, 2024

  77. [77]

    Implementation and benchmarking of perceptual image hash functions

    Christoph Zauner. Implementation and benchmarking of perceptual image hash functions. Master’s thesis, University of Applied Sciences Hagenberg, 2010

  78. [78]

    The chembl database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods

    Barbara Zdrazil, Eloy Felix, Fiona Hunter, Emma J Manners, James Blackshaw, Sybilla Corbett, Marleen de Veij, Harris Ioannidis, David Mendez Lopez, Juan F Mosquera, Maria Paula Magarinos, Nicolas Bosc, Ricardo Arcila, Tevfik Kiziloren, Anna Gaulton, A Patricia Bento, Melissa F Adasme, Peter Monecke, Gregory A Landrum, and Andrew R Leach. The chembl databa...

  79. [79]

    Magicbrush: A manually annotated dataset for instruction-guided image editing, 2023

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing, 2023

  80. [80]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

Showing first 80 references.