GenExam: A Multidisciplinary Text-to-Image Exam
Pith reviewed 2026-05-18 15:46 UTC · model grok-4.3
The pith
GenExam introduces a benchmark of 1,000 multidisciplinary exam prompts to test text-to-image models on integrated understanding, reasoning and generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GenExam is introduced as a benchmark consisting of 1,000 exam-style prompts across 10 subjects under a four-level taxonomy, each paired with ground-truth images and fine-grained scoring points. Testing 17 text-to-image and unified models reveals the benchmark's difficulty and a significant gap where open-source models lag behind leading closed-source ones. The benchmark assesses the integration of understanding, reasoning, and generation in image synthesis.
What carries the argument
The GenExam benchmark, organized by a four-level taxonomy of exam-style prompts together with 1,000 samples, ground-truth images, and fine-grained scoring points that jointly evaluate semantic correctness and visual plausibility.
If this is right
- Text-to-image models must improve their ability to combine conceptual understanding with step-by-step reasoning before producing images.
- Open-source models require targeted advances to close the observed gap on expert-level generation tasks.
- GenExam supplies a standard for measuring whether generative systems are approaching integrated expert intelligence.
- Existing image-generation benchmarks miss the rigorous evaluation of drawing exams that require precise semantic and visual fidelity.
Where Pith is reading between the lines
- The taxonomy could be reused to generate synthetic training data that forces models to produce explicit reasoning traces before rendering an image.
- Similar exam-style benchmarks may be constructed for other generative domains such as video synthesis or scientific diagram creation.
- The performance difference suggests that future scaling efforts should incorporate explicit reasoning objectives rather than pure visual fidelity alone.
- Human validation studies on the scoring points could be run on a held-out subset to quantify how well the automated metric aligns with expert judgment.
Load-bearing premise
The assumption that the four-level taxonomy, 1,000 curated samples, and fine-grained scoring points together provide a precise and representative measure of semantic correctness and visual plausibility for expert-level image generation tasks.
What would settle it
An open-source model that matches or exceeds the scores of the leading closed-source models on the full set of GenExam prompts would falsify the reported performance gap.
Figures
read the original abstract
Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments on 17 text-to-image and unified models demonstrate the great challenge of GenExam and the huge gap where open-source models consistently lag behind the leading closed-source ones. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate understanding, reasoning, and generation, providing insights for on the path to intelligent generative models. Our benchmark and evaluation code are released at https://github.com/OpenGVLab/GenExam.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GenExam, the first benchmark for multidisciplinary text-to-image exams. It features 1,000 curated samples across 10 subjects, organized under a four-level taxonomy, with each problem supplied with ground-truth images and fine-grained scoring points to assess semantic correctness and visual plausibility. Experiments on 17 text-to-image and unified models show that the benchmark is challenging and reveal a consistent performance gap in which open-source models lag behind leading closed-source models. The work frames image generation as an exam to evaluate integrated understanding, reasoning, and generation capabilities.
Significance. If the scoring methodology can be shown to be reliable and reproducible, GenExam would provide a useful new tool for evaluating expert-level generative capabilities that go beyond existing world-knowledge illustration benchmarks. The public release of the benchmark, ground-truth images, scoring points, and evaluation code is a clear strength that supports reproducibility and future extensions by the community.
major comments (2)
- [§3] §3 (Benchmark Construction): The central claim that GenExam enables 'precise evaluation of semantic correctness and visual plausibility' rests on the fine-grained scoring points, yet the manuscript supplies no information on how these points were authored, aligned to the four-level taxonomy, subjected to expert review, or pilot-tested. No inter-rater reliability statistics are reported. This directly undermines the reliability of the 17-model comparison.
- [§4] §4 (Experiments): The reported performance gap between open- and closed-source models is presented as evidence of GenExam's rigor, but without validation of the scoring rubric (e.g., agreement metrics or sensitivity analysis), it remains possible that differences in how 'visual plausibility' is interpreted across the 10 subjects drive the observed results rather than genuine capability differences.
minor comments (2)
- [Abstract] The abstract contains a minor grammatical issue ('insights for on the path').
- [Figure 1] Ensure that any figures illustrating the four-level taxonomy include concrete examples from multiple subjects to improve clarity for readers unfamiliar with the taxonomy.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. We have carefully reviewed the concerns regarding the details of scoring point creation and the validation of the evaluation methodology. Below we respond point by point to the major comments. Where appropriate, we indicate revisions that will be incorporated into the next version of the paper.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction): The central claim that GenExam enables 'precise evaluation of semantic correctness and visual plausibility' rests on the fine-grained scoring points, yet the manuscript supplies no information on how these points were authored, aligned to the four-level taxonomy, subjected to expert review, or pilot-tested. No inter-rater reliability statistics are reported. This directly undermines the reliability of the 17-model comparison.
Authors: We acknowledge that the original manuscript provides insufficient detail on the development of the fine-grained scoring points. The points were created by the core author team in consultation with domain experts for each of the 10 subjects, with explicit mapping to the four-level taxonomy: level-1 and level-2 criteria focus on semantic correctness (object presence, attribute accuracy, relational structure), while level-3 and level-4 criteria address visual plausibility (style consistency, spatial layout, and absence of artifacts). A small-scale pilot was conducted internally on 50 samples to refine the rubric before full annotation. In the revised manuscript we will expand §3 with a dedicated subsection describing this process, including the expert consultation steps and pilot outcomes. We did not collect formal inter-rater reliability statistics during the initial construction; therefore we cannot report Cohen’s kappa or similar metrics at this time. revision: partial
-
Referee: [§4] §4 (Experiments): The reported performance gap between open- and closed-source models is presented as evidence of GenExam's rigor, but without validation of the scoring rubric (e.g., agreement metrics or sensitivity analysis), it remains possible that differences in how 'visual plausibility' is interpreted across the 10 subjects drive the observed results rather than genuine capability differences.
Authors: We agree that stronger validation of the rubric would increase confidence in the reported gap. The scoring rubric was applied uniformly by the same annotation team across all subjects, and the performance ordering remains consistent when results are broken down by individual subject. Nevertheless, we accept that without sensitivity analysis it is difficult to fully rule out rubric-interpretation effects. In the revision we will add a sensitivity study that re-scores a subset of generations under varied weightings of semantic versus visual components and show that the open- versus closed-source gap persists. We maintain that the gap primarily reflects capability differences, but we will explicitly note the limitation of the current validation in the revised text. revision: partial
- Formal inter-rater reliability statistics (e.g., Cohen’s kappa) were not collected during benchmark construction and therefore cannot be reported without additional annotation effort.
Circularity Check
No circularity: empirical benchmark with no derivations or self-referential predictions
full rationale
This is an empirical benchmark paper that introduces GenExam with 1,000 curated samples, a four-level taxonomy, ground-truth images, and fine-grained scoring points, then reports performance of 17 models. The abstract and provided text contain no equations, no claimed first-principles derivations, no predictions that reduce to fitted parameters, and no load-bearing self-citations that justify uniqueness or ansatzes. The central claim (open-source models lag closed-source ones on integrated understanding/reasoning/generation) rests on direct experimental comparison rather than any reduction to the benchmark's own inputs by construction. No steps satisfy the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experiments on 17 text-to-image and unified models demonstrate the great challenge of GenExam
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing
MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.
-
PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks
PlanViz is a new benchmark with three sub-tasks and PlanScore metric to evaluate planning-oriented image generation and editing by unified multimodal models for computer-use tasks.
-
How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning
Reinforcement learning with three causal constraints enables multimodal models to internalize diagram-reasoning links in geometry, unlike SFT which only mimics surface format and harms performance.
Reference graph
Works this paper leans on
-
[1]
graphs, diagrams, plots, icons, geometry, etc.), it is 0
Text Richness Score (range: 0-10): If the image content is pure table/text without any other content (e.g. graphs, diagrams, plots, icons, geometry, etc.), it is 0. If the image does not contain any text, it is 10. For other cases, the score is based on the complexity of the image content
-
[2]
Image Domain (range: 0-1): If the image is a natural image (i.e., real image) / pathological image (also including body scans, MRI, CT scans, and X-rays), it is 0, otherwise it is 1
-
[3]
Image Complexity (range: 1-10): Whether the image is too complex to draw (consider the number and complexity of compo- nents,objects, text, etc.), very complex is 1, very simple is 10
-
[4]
Subject Knowledge (range: 1-10): Whether the model needs **subject knowledge and reasoning** to generate this image (note: the prompt is not a complete description of the image, it is related to subject knowledge), no need is 1, need a lot is 10. ## Instructions For each domain, provide: - **Score**: An integer, based on the range defined above. - **Confi...
-
[5]
The prompt should **NOT** be a complete description of the image. It should be in a style similar to an **academic question** that **requires some subject knowledge and reasoning** to generate the image. If a model does not have such subject knowledge, it is expected to fail to generate the correct image
-
[6]
The prompt should be precise and concise (less than 200 words) and should be in English
-
[7]
The provided conversation and text in the image can sometimes help you to generate the prompt, but you should consider them carefully since they may contain some irrelevant information. ## Scoring Points
-
[8]
The scoring points are in the form of a list of questions and their corresponding scores, answered by “Yes” or “No” only. They are designed so that by answering the questions, we can evaluate whether the model can generate the correct image corresponding to the prompt. It is expected that the answers are all “Yes” if the model generates the correct image
-
[9]
The number of scoring points should be no more than 20. If the image is simple and requires little subject knowledge and reasoning, the number of scoring points can be small
-
[10]
The questions should be based on the prompt instead of the image, i.e. since the model is required to generate the image based on the prompt, the questions should not focus on information or components that are in the ground truth image but not in the prompt
-
[11]
The questions should not be too general. It should provide **details** rather than “Is xxx correct?”, e.g. the structure of a molecule (bonds, position of atoms, etc.), characteristics that a curve should satisfy, etc. It should not be too complex as you should break it down into several sub-questions
-
[12]
The score of each scoring point should be a number between 0 and 1, indicating the proportion of this question in the total score of the image.**The sum of all scores should be 1**
-
[13]
checking the overall structure of the image) should be small, such as 0.1 or 0.2
Typically, the score of the most basic question (e.g. checking the overall structure of the image) should be small, such as 0.1 or 0.2. For complex questions, questions requiring subject knowledge, or questions about details, the score should be large. ## Examples
-
[14]
Prompt: “Generate a planar molecular structure diagram of benzene (C 6H6), showing its conjugated double bond structure, and distinguish carbon atoms and hydrogen atoms with different colors.”, Scoring points: [{{“question”: “Is the number of carbon atoms 6?”, “score”: 0.2}},{{“question”: “Is the number of hydrogen atoms 6?”, “score”: 0.2}},{{“question”: ...
-
[15]
Generate the graph of the functiony=e x
Prompt: “Generate the graph of the functiony=e x.”, Scoring points: [{{“question”: “Does the image shows a graph of a function?”, “score”: 0.1}},{{“question”: “Does the curve pass through the point (0, 1)?”, “score”: 0.2}},{{“question”: “Does the curve asymptotically approach the x-axis when x approaches negative infinity?”, “score”: 0.2}},{{“question”: “...
-
[16]
Prompt: “Generate a schematic diagram of an animal cell structure. Label cell membrane, cytoplasm, nucleus, and mitochondria.”, Scoring points: [{{“question”: “Does the image contain a cell structure?”, “score”: 0.1}},{{“question”: “Does the cell membrane label correspond to the correct position in the cell?”, “score”: 0.2}},{{“question”: “Does the cytopl...
-
[17]
Remember that the first image is the generated image from a model, and the second image is the ground truth image. You should **only evaluate the generated image**, while the ground truth image can be used for reference
-
[18]
First, give a detailed description of all the components in the generated image
-
[19]
You should generate a **detailed** reasoning **step by step**
For each scoring point, you should use the ground truth image as **reference information** to make more accurate judgments. You should generate a **detailed** reasoning **step by step**. It should first analyze the question, what subject knowledge it requires and what information you need to refer to from the ground truth image, then analyze whether the g...
-
[20]
For each perspective, first provide a **detailed step-by-step** reasoning, e.g
You should also evaluate the generated image from some global perspectives provided below. For each perspective, first provide a **detailed step-by-step** reasoning, e.g. recognize all the text labels for spelling, then give a score in the defined range. ## Global Perspectives
-
[21]
Spelling (range: 0-2): The spelling of the text in the image, including the notations and equations. You should first recognize the text in the image in the reasoning, then check the spelling of the text. Specifically: - 0: There are critical errors in spelling, notations or equations which significantly hinders the understanding of the image. - 1: There ...
-
[22]
Each component in the image should be clearly readable and identifiable
Readability (range: 0-2): The readability of the image. Each component in the image should be clearly readable and identifiable. All text labels and marks should in the right place and are not overlapped or occluded by other elements. If the image is geometry or diagram, there should not be unlabeled points or lines or duplicated labels. If the image is a...
-
[23]
Logical Consistency (range: 0-2): The logical consistency of the image. Check the correctness of all the marks, text, musical notes, etc. If the image is geometry, check the correctness of each marked angles, lengths, coordinates, etc. If the image is a plot or chart, check the correctness of each data point, the axis, the ticks, the legend, etc. You shou...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.