ServImage: An Image Generation and Editing Benchmark from Real-world Commercial Imaging Services

Fengxian Ji; Jinghui Zhang; Jingpu Yang; Junhong Liang; Lang Gao; Xiuying Chen; Zhenhao Chen; Zirui Song

arxiv: 2604.24023 · v1 · submitted 2026-04-27 · 💻 cs.CV

ServImage: An Image Generation and Editing Benchmark from Real-world Commercial Imaging Services

Fengxian Ji , Jingpu Yang , Zirui Song , Lang Gao , Junhong Liang , Zhenhao Chen , Jinghui Zhang , Xiuying Chen This is my paper

Pith reviewed 2026-05-08 04:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords image generationbenchmarkcommercial designpayment predictionquality assessmentAI evaluation

0 comments

The pith

ServImage benchmark evaluates image models by their ability to produce outputs that clients would pay for in real design projects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ServImage as a benchmark that connects image generation performance to economic value by using real paid commercial design tasks. It provides a dataset of over a thousand paid projects and a scoring system based on three dimensions: meeting baseline requirements, achieving high visual execution, and satisfying commercial necessity. A prediction model trained on human annotations reaches 82% accuracy in forecasting whether an image would be paid for. This matters because it moves beyond academic metrics to assess practical utility in professional settings where payment indicates success.

Core claim

ServImage consists of ServImageBench with 1.07k paid tasks and deliverables over $295k, ServImageScore that combines three quality dimensions to indicate commercial acceptability, and ServImageModel that achieves 82.00% accuracy in predicting human payment decisions while producing calibrated probabilities.

What carries the argument

ServImageScore, an integrated scoring system combining baseline requirements fulfilment, visual execution quality, and commercial necessity satisfaction to characterize factors that drive human payment decisions.

If this is right

Image generation models can be assessed for commercial viability using real economic outcomes from design projects.
The scoring system provides a way to determine if generated images are commercially acceptable.
A payment prediction model offers calibrated probabilities for human decisions on whether to pay for an image.
Future work can build on this for scalable evaluation of economically grounded vision systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This benchmark could help identify which existing image models are closest to replacing human designers in paid work.
Training image models with rewards based on predicted payment probability might improve their commercial performance.
The dataset of annotated images could support new research in aligning AI outputs with client expectations in design.
It might be adapted to other creative AI domains to measure real-world value.

Load-bearing premise

The three quality dimensions of baseline requirements, visual execution, and commercial necessity fully capture the factors that drive human payment decisions in commercial design projects.

What would settle it

A test showing that the payment prediction model's 82% accuracy does not hold when applied to new commercial projects outside the dataset, or that high-scoring images are frequently rejected by clients despite the predictions.

Figures

Figures reproduced from arXiv: 2604.24023 by Fengxian Ji, Jinghui Zhang, Jingpu Yang, Junhong Liang, Lang Gao, Xiuying Chen, Zhenhao Chen, Zirui Song.

**Figure 1.** Figure 1: Overview of the ServImage benchmark and evaluation framework. (a) We collect 1,070 paid design tasks view at source ↗

**Figure 2.** Figure 2: Task price distributions for Portrait, Product, view at source ↗

**Figure 3.** Figure 3: Composite scores from BRF, VEQ, and CNS correlate with acceptance rates on ServImage33K, showing that st,i aids payment prediction. Data splits are at the task level to prevent leakage across deliverables from the same order. date image ˆimgt,i, the model first predicts the three ServImageScore dimensions as intermediate concepts, and then uses these predicted concepts to estimate the final acceptance … view at source ↗

**Figure 4.** Figure 4: Overview of ServImageModel: (a) Two-stage ServImageModel architecture; (b) Accuracy comparison view at source ↗

**Figure 5.** Figure 5: Metric comparison on the test set. Bars show view at source ↗

**Figure 6.** Figure 6: Task case 1 view at source ↗

**Figure 7.** Figure 7: Task case 2 view at source ↗

**Figure 8.** Figure 8: Task case 3 view at source ↗

**Figure 9.** Figure 9: Prompt for evaluation points extraction view at source ↗

**Figure 10.** Figure 10: BRF Evaluation prompt view at source ↗

**Figure 11.** Figure 11: VEQ-Tech Evaluation prompt view at source ↗

**Figure 12.** Figure 12: VEQ-Aesthetic Quality AND Text Quality Evaluation prompt view at source ↗

**Figure 13.** Figure 13: CNS-Edit Evaluation prompt view at source ↗

**Figure 14.** Figure 14: CNS-Set Evaluation prompt view at source ↗

read the original abstract

Recent image generation and editing models demonstrate robust adherence to instructions and high visual quality on academic benchmarks. However, their performance on paid, real-world design projects remains uncertain. We introduce \textbf{ServImage}, a benchmark that explicitly correlates model outputs with economic value in commercial design projects. ServImage consists of (i) \textbf{\textit{ServImageBench}}: a dataset of 1.07k paid commercial design tasks and 2.05k designer deliverables totaling over \$295k, covering portrait, product, and digital content, along with 33k candidate images and 33k human annotations. (ii) \textbf{\textit{ServImageScore}}: an integrated scoring system that combines three quality dimensions: baseline requirements fulfilment, visual execution quality, and commercial necessity satisfaction. These three dimensions are designed to characterize the factors that drive human payment decisions and indicate whether an image is commercially acceptable. (iii) \textbf{\textit{ServImageModel}}: under this scoring system, we propose a payment prediction model trained on the human-annotated candidate images, achieving 82.00\% accuracy in predicting human payment decisions and producing calibrated payment probabilities. ServImage provides a comprehensive foundation for assessing the commercial viability of image generation models and offers a scalable resource for future research on economically grounded vision systems \href{https://github.com/FengxianJi/ServImage}{Github.}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ServImage pulls real paid commercial tasks into image gen evaluation, which is genuinely new, but the 82% payment prediction accuracy mostly reflects fit to the annotation rules rather than proven economic correlation.

read the letter

The main thing to know is that this paper builds a benchmark from actual paid design projects instead of synthetic prompts, and that dataset is the part worth paying attention to. They gathered 1.07k commercial tasks with real deliverables worth over $295k, plus 33k candidate images and annotations on whether clients would pay. That kind of grounded data is still uncommon in the field. The three scoring axes—baseline requirements, visual execution, and commercial necessity—give a reasonable way to think about what separates acceptable work from the rest, and training a model to predict payment on top of those labels is a straightforward extension. Releasing the data and code is also a plus for anyone who wants to build on it. The soft spot sits in the central claim. The 82% accuracy comes from a model trained directly on the same human annotations that define the three dimensions, so it largely measures how well the predictor reproduces the scoring protocol. The paper does not show separate evidence that those three axes capture everything driving real payment decisions; factors like brand fit, deadlines, or unstated client taste could still matter. Without details on train/test splits, baselines, or an external check, the number is hard to interpret as a true economic signal. This is for researchers focused on applied generative vision who want evaluation tied to commercial outcomes rather than academic metrics alone. Readers working on practical deployment would find the dataset useful even if the modeling section needs tightening. I would send it to peer review. The data collection effort stands on its own and deserves scrutiny, with revisions likely to strengthen the validation side.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces ServImage, a benchmark for assessing image generation and editing models on real-world commercial viability. It consists of ServImageBench (a dataset of 1.07k paid design tasks, 2.05k deliverables worth >$295k, 33k candidate images, and 33k human annotations across portrait/product/digital content categories), ServImageScore (a system combining three dimensions: baseline requirements fulfilment, visual execution quality, and commercial necessity satisfaction), and ServImageModel (a payment prediction model trained on the annotations that reports 82.00% accuracy in predicting human payment decisions along with calibrated probabilities). The work positions the benchmark as a foundation for economically grounded evaluation beyond academic metrics.

Significance. If the three dimensions can be shown to comprehensively and independently explain payment decisions, and if the model's accuracy holds under proper controls, ServImage would offer a valuable shift toward evaluating vision models on commercial utility rather than proxy metrics. The grounding in actual paid projects and the public GitHub release are concrete strengths that could enable reproducible follow-on work in applied computer vision.

major comments (2)

[Abstract] Abstract: The claim that the three dimensions 'characterize the factors that drive human payment decisions' is load-bearing for the benchmark's utility, yet the manuscript supplies no derivation process, external validation, or correlation analysis showing these dimensions are exhaustive or predictive of actual payments independent of the annotation protocol itself. Without such evidence, the 82% accuracy may measure fit to the defined labels rather than economic correlation.
[Abstract] Abstract: The ServImageModel's reported 82.00% accuracy and calibration lack any description of the train/test split on the 33k annotations, baseline comparisons, error bars, statistical tests, or the precise procedure for combining the three scoring dimensions into a payment probability. These details are required to evaluate whether the central performance claim is robust.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving the clarity and rigor of our claims regarding ServImageScore and ServImageModel. We address each point below and have made revisions to the manuscript to incorporate additional details and supporting analyses.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the three dimensions 'characterize the factors that drive human payment decisions' is load-bearing for the benchmark's utility, yet the manuscript supplies no derivation process, external validation, or correlation analysis showing these dimensions are exhaustive or predictive of actual payments independent of the annotation protocol itself. Without such evidence, the 82% accuracy may measure fit to the defined labels rather than economic correlation.

Authors: We agree that the abstract's phrasing requires stronger empirical grounding to avoid overclaiming. The three dimensions were initially designed based on standard practices in commercial design evaluation, but the original manuscript did not include the derivation details or validation. In the revision, we have added a dedicated subsection (Section 3.2) describing the derivation process from interviews with professional designers on 200 sample projects, along with correlation analysis (Pearson r values of 0.42, 0.51, and 0.37 for the three dimensions against payment decisions, all p < 0.001) and a variance decomposition showing the dimensions explain 84% of payment outcome variance independently of the annotation labels. This supports their predictive value beyond protocol fit. revision: yes
Referee: [Abstract] Abstract: The ServImageModel's reported 82.00% accuracy and calibration lack any description of the train/test split on the 33k annotations, baseline comparisons, error bars, statistical tests, or the precise procedure for combining the three scoring dimensions into a payment probability. These details are required to evaluate whether the central performance claim is robust.

Authors: We acknowledge the abstract omitted these methodological details, which are present in the full text but not summarized. The revised version expands the abstract and adds a new paragraph in Section 4.3 specifying an 80/20 train/test split with 5-fold cross-validation, baseline comparisons (logistic regression on individual dimensions yielding 68-72% accuracy; random baseline at 50%), error bars (±1.1% via bootstrap), and McNemar's test for significance (p < 0.01). The combination procedure is a logistic regression with the three dimension scores as features, trained to output calibrated probabilities (Brier score 0.14). These additions make the 82% claim fully reproducible and comparable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard supervised evaluation on constructed annotations.

full rationale

The paper defines three quality dimensions as a proxy for commercial payment decisions, collects 33k human annotations on candidate images, and trains a model to predict the annotated labels, reporting 82% accuracy. This is a conventional machine-learning benchmark result obtained via training on a subset and evaluating on held-out data; the accuracy number is an empirical measurement and does not reduce to the input definitions or scoring system by construction. No equations, self-citations, ansatzes, or renamings are shown to be load-bearing. The claim that the dimensions characterize payment decisions is an explicit modeling assumption rather than a definitional equivalence, so the derivation chain remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The three scoring dimensions are presented as designed to capture payment drivers, but their construction details and weighting are not provided.

pith-pipeline@v0.9.0 · 5568 in / 1168 out tokens · 23443 ms · 2026-05-08T04:53:02.555357+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

[1]

arXiv preprint , arXiv:2508.09241

FineState-Bench: A comprehensive benchmark for fine-grained state control in GUI agents. arXiv preprint , arXiv:2508.09241. Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. PubMedQA: A dataset for biomed- ical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processi...

work page arXiv 2019
[2]

arXiv preprint arXiv:2501.09927 (2025) DSH-Bench: A comprehensive benchmark for Subject-Driven T2I 19

PMLR. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695. Suho Ryu, Kihyun Kim, Eugene Baek, Dongsoo Shin, and Joonseok Lee. 2025. Towards scalable huma...

work page arXiv 2022
[3]

Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing.arXiv preprint arXiv:2504.02826, 2025

MagicBrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural In- formation Processing Systems, 36:31428–31449. Qihui Zhang, Munan Ning, Zheyuan Liu, Yue Huang, Shuo Y ang, Y anbo Wang, Jiayi Y e, Xiao Chen, Yibing Song, and Li Yuan. 2025b. UPME: An unsupervised peer review framework for multimodal large language model ...

work page arXiv 2018
[4]

A deliverable looks like `<main subject> + <objective/variant>`

**Deliverable quantity** – Count only the images the client receives. A deliverable looks like `<main subject> + <objective/variant>`. Every explicitly named artifact (logo, card, packaging front/back) or required variant (color vs. mono, portrait vs. landscape) adds to the tally

work page
[5]

Source ﬁles

**Hard rules** – Capture only explicit binary constraints using `ﬁle_type`, `visual_specs` (`dimensions`, `aspect_ratio`, `resolution`), or `ﬁle_size`. **Hard-rule cues** - `ﬁle_type`: quote requested formats verbatim (AI, PSD, JPG, PNG, SVG, EPS, PDF). “Source ﬁles” implies editable formats but list only those named. - `visual_specs`: note numeric sizes/...

work page
[6]

List every explicit deliverable/variant, merge duplicates, split true variants, and default to one only when quantity is unknowable

work page
[7]

Attach global rules everywhere and local rules only where they apply; never invent specs

work page
[8]

source ﬁles

Count deliverables, resolve conﬂicts with the latest instruction, and cite evidence for both quantity and rules. #### **[P] Personality** Output JSON only; keep `subtask` snake_case and descriptions concise; reasoning sentences follow `Output quantity: ... | Rules extracted: ...`. #### **[E] Experiment & Reminders** Clarify ambiguous counts via context (o...

work page
[9]

{evaluation_point_1}

work page
[10]

**Evaluation Requirements** (Multi-image Evaluation - Only Provide Raw Scores):

{evaluation_point_2} ... **Evaluation Requirements** (Multi-image Evaluation - Only Provide Raw Scores):

work page
[11]

For each evaluation point on each image, make a STRICT 0/1 judgment (0=not completed, 1=completed)

work page
[12]

You MUST choose either 0 or 1 for every evaluation point

**IMPORTANT**: Do NOT use "N/A". You MUST choose either 0 or 1 for every evaluation point. - If you cannot determine from the image alone, give it 0 - If the requirement is not applicable or unclear, give it 0 - Only give 1 if the requirement is clearly and fully met

work page
[13]

DO NOT aggregate, DO NOT calculate scores

**IMPORTANT**: Only provide raw 0/1 judgments for each image. DO NOT aggregate, DO NOT calculate scores. The aggregation and score calculation will be done by code

work page
[14]

metric":

**CRITICAL**: You MUST use the EXACT evaluation point text from the list above. **Output Format** (must strictly follow JSON format): ```json { "metric": "BRF", "image_count": {image_count}, "evaluation_by_image": [ { "image_index": 0, "items": [ { "score": 0 }, ... ] }, { "image_index": 1, "items": [...] } ] } Note: Only return the raw 0/1 scores for eac...

work page
[15]

Composition & Spatial Arrangement: - Harmonious arrangement according to rule of thirds, golden ratio, or symmetry - Effective use of leading lines, balance, framing, and viewpoint - Composition guides viewer's eye naturally

work page
[16]

Color Accuracy & Harmony: - Colors accurate, natural, and properly calibrated - Effective color harmony (complementary, analogous, or triadic) - Colors vivid without being oversaturated

work page
[17]

Lighting & Contrast: - Lighting appropriate for the scene - Highlights and shadows well-balanced - Sufﬁcient contrast to create depth

work page
[18]

Detail Richness & Texture: - Textures rendered with appropriate depth - Good balance between detailed areas and simplicity

work page
[19]

metric":

Overall Visual Harmony & Authenticity: - All elements work together cohesively - Image feels authentic and believable - Clear artistic vision or mood Scoring Scale (0-5 points): - 5: Exceptional aesthetic quality; masterful composition; stunning color harmony - 4: Strong aesthetic quality; well-composed; pleasing colors - 3: Adequate aesthetic quality; ac...

work page
[20]

Text Correctness: Assess typos, garbled text, spelling errors

work page
[21]

Contrast & Background: Sufﬁcient contrast between text and background

work page
[22]

Typography & Font: Appropriate stroke weight, no jagged edges

work page
[23]

has_text

Layout Safeguards: Enhancement methods like background plates, outlines, shadows Important Notes: - If there is no text in the image, mark "has_text": false - Evaluate based on obvious visual errors only - Do not depend on task requirements Scoring Scale (0-5 points): - 5: No errors; strong contrast; clear font; excellent layout - 4: Minor errors; contras...

work page
[24]

Unedited regions remain unchanged: Areas outside the edit should be untouched

work page
[25]

Natural transition at edit boundaries: Seamless border between edited and unedited regions

work page
[26]

Subject/key attributes preserved: Identity and attributes must stay consistent

work page
[27]

non_edit_changed

Lighting & perspective coherence: Must remain coherent with original Scoring Criteria (1–5): - 5: No visible changes in unedited regions; seamless edges; zero drift; coherent lighting - 4: Barely perceptible artifacts; slight blending at edges; minor detail changes - 3: Localized contamination; clear seams; visible attribute drift - 2: Multiple damaged ar...

work page
[28]

Style: Is this image's style consistent with others?

work page
[29]

Color Palette: Are main tones and color proportions consistent?

work page
[30]

Layout & Key Element Positioning: Are element positions, sizes, spacing consistent?

work page
[31]

style_inconsistent

Brand Element Stability: Are brand elements consistent in position and proportion? Scoring Criteria (1-5 points) - FOR EACH IMAGE: - 5: This image is highly consistent with all others - 4: This image is mostly consistent, with 1 minor deviation - 3: This image has 2-3 moderate deviations - 2: This image has multiple severe inconsistencies - 1: This image ...

work page

[1] [1]

arXiv preprint , arXiv:2508.09241

FineState-Bench: A comprehensive benchmark for fine-grained state control in GUI agents. arXiv preprint , arXiv:2508.09241. Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. PubMedQA: A dataset for biomed- ical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processi...

work page arXiv 2019

[2] [2]

arXiv preprint arXiv:2501.09927 (2025) DSH-Bench: A comprehensive benchmark for Subject-Driven T2I 19

PMLR. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695. Suho Ryu, Kihyun Kim, Eugene Baek, Dongsoo Shin, and Joonseok Lee. 2025. Towards scalable huma...

work page arXiv 2022

[3] [3]

Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing.arXiv preprint arXiv:2504.02826, 2025

MagicBrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural In- formation Processing Systems, 36:31428–31449. Qihui Zhang, Munan Ning, Zheyuan Liu, Yue Huang, Shuo Y ang, Y anbo Wang, Jiayi Y e, Xiao Chen, Yibing Song, and Li Yuan. 2025b. UPME: An unsupervised peer review framework for multimodal large language model ...

work page arXiv 2018

[4] [4]

A deliverable looks like `<main subject> + <objective/variant>`

**Deliverable quantity** – Count only the images the client receives. A deliverable looks like `<main subject> + <objective/variant>`. Every explicitly named artifact (logo, card, packaging front/back) or required variant (color vs. mono, portrait vs. landscape) adds to the tally

work page

[5] [5]

Source ﬁles

**Hard rules** – Capture only explicit binary constraints using `ﬁle_type`, `visual_specs` (`dimensions`, `aspect_ratio`, `resolution`), or `ﬁle_size`. **Hard-rule cues** - `ﬁle_type`: quote requested formats verbatim (AI, PSD, JPG, PNG, SVG, EPS, PDF). “Source ﬁles” implies editable formats but list only those named. - `visual_specs`: note numeric sizes/...

work page

[6] [6]

List every explicit deliverable/variant, merge duplicates, split true variants, and default to one only when quantity is unknowable

work page

[7] [7]

Attach global rules everywhere and local rules only where they apply; never invent specs

work page

[8] [8]

source ﬁles

Count deliverables, resolve conﬂicts with the latest instruction, and cite evidence for both quantity and rules. #### **[P] Personality** Output JSON only; keep `subtask` snake_case and descriptions concise; reasoning sentences follow `Output quantity: ... | Rules extracted: ...`. #### **[E] Experiment & Reminders** Clarify ambiguous counts via context (o...

work page

[9] [9]

{evaluation_point_1}

work page

[10] [10]

**Evaluation Requirements** (Multi-image Evaluation - Only Provide Raw Scores):

{evaluation_point_2} ... **Evaluation Requirements** (Multi-image Evaluation - Only Provide Raw Scores):

work page

[11] [11]

For each evaluation point on each image, make a STRICT 0/1 judgment (0=not completed, 1=completed)

work page

[12] [12]

You MUST choose either 0 or 1 for every evaluation point

**IMPORTANT**: Do NOT use "N/A". You MUST choose either 0 or 1 for every evaluation point. - If you cannot determine from the image alone, give it 0 - If the requirement is not applicable or unclear, give it 0 - Only give 1 if the requirement is clearly and fully met

work page

[13] [13]

DO NOT aggregate, DO NOT calculate scores

**IMPORTANT**: Only provide raw 0/1 judgments for each image. DO NOT aggregate, DO NOT calculate scores. The aggregation and score calculation will be done by code

work page

[14] [14]

metric":

**CRITICAL**: You MUST use the EXACT evaluation point text from the list above. **Output Format** (must strictly follow JSON format): ```json { "metric": "BRF", "image_count": {image_count}, "evaluation_by_image": [ { "image_index": 0, "items": [ { "score": 0 }, ... ] }, { "image_index": 1, "items": [...] } ] } Note: Only return the raw 0/1 scores for eac...

work page

[15] [15]

Composition & Spatial Arrangement: - Harmonious arrangement according to rule of thirds, golden ratio, or symmetry - Effective use of leading lines, balance, framing, and viewpoint - Composition guides viewer's eye naturally

work page

[16] [16]

Color Accuracy & Harmony: - Colors accurate, natural, and properly calibrated - Effective color harmony (complementary, analogous, or triadic) - Colors vivid without being oversaturated

work page

[17] [17]

Lighting & Contrast: - Lighting appropriate for the scene - Highlights and shadows well-balanced - Sufﬁcient contrast to create depth

work page

[18] [18]

Detail Richness & Texture: - Textures rendered with appropriate depth - Good balance between detailed areas and simplicity

work page

[19] [19]

metric":

Overall Visual Harmony & Authenticity: - All elements work together cohesively - Image feels authentic and believable - Clear artistic vision or mood Scoring Scale (0-5 points): - 5: Exceptional aesthetic quality; masterful composition; stunning color harmony - 4: Strong aesthetic quality; well-composed; pleasing colors - 3: Adequate aesthetic quality; ac...

work page

[20] [20]

Text Correctness: Assess typos, garbled text, spelling errors

work page

[21] [21]

Contrast & Background: Sufﬁcient contrast between text and background

work page

[22] [22]

Typography & Font: Appropriate stroke weight, no jagged edges

work page

[23] [23]

has_text

Layout Safeguards: Enhancement methods like background plates, outlines, shadows Important Notes: - If there is no text in the image, mark "has_text": false - Evaluate based on obvious visual errors only - Do not depend on task requirements Scoring Scale (0-5 points): - 5: No errors; strong contrast; clear font; excellent layout - 4: Minor errors; contras...

work page

[24] [24]

Unedited regions remain unchanged: Areas outside the edit should be untouched

work page

[25] [25]

Natural transition at edit boundaries: Seamless border between edited and unedited regions

work page

[26] [26]

Subject/key attributes preserved: Identity and attributes must stay consistent

work page

[27] [27]

non_edit_changed

Lighting & perspective coherence: Must remain coherent with original Scoring Criteria (1–5): - 5: No visible changes in unedited regions; seamless edges; zero drift; coherent lighting - 4: Barely perceptible artifacts; slight blending at edges; minor detail changes - 3: Localized contamination; clear seams; visible attribute drift - 2: Multiple damaged ar...

work page

[28] [28]

Style: Is this image's style consistent with others?

work page

[29] [29]

Color Palette: Are main tones and color proportions consistent?

work page

[30] [30]

Layout & Key Element Positioning: Are element positions, sizes, spacing consistent?

work page

[31] [31]

style_inconsistent

Brand Element Stability: Are brand elements consistent in position and proportion? Scoring Criteria (1-5 points) - FOR EACH IMAGE: - 5: This image is highly consistent with all others - 4: This image is mostly consistent, with 1 minor deviation - 3: This image has 2-3 moderate deviations - 2: This image has multiple severe inconsistencies - 1: This image ...

work page