Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks

· 2026 · cs.CV · arXiv 2604.04192

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

We introduce GraphicDesignBench (GDB), the first comprehensive benchmark suite designed specifically to evaluate AI models on the full breadth of professional graphic design tasks. Unlike existing benchmarks that focus on natural-image understanding or generic text-to-image synthesis, GDB targets the unique challenges of professional design work: translating communicative intent into structured layouts, rendering typographically faithful text, manipulating layered compositions, producing valid vector graphics, and reasoning about animation. The suite comprises 50 tasks organized along five axes: layout, typography, infographics, template & design semantics and animation, each evaluated under both understanding and generation settings, and grounded in real-world design templates drawn from the LICA layered-composition dataset. We evaluate a set of frontier closed-source models using a standardized metric taxonomy covering spatial accuracy, perceptual quality, text fidelity, semantic alignment, and structural validity. Our results reveal that current models fall short on the core challenges of professional design: spatial reasoning over complex layouts, faithful vector code generation, fine-grained typographic perception, and temporal decomposition of animations remain largely unsolved. While high-level semantic understanding is within reach, the gap widens sharply as tasks demand precision, structure, and compositional awareness. GDB provides a rigorous, reproducible testbed for tracking progress toward AI systems that can function as capable design collaborators. The full evaluation framework is publicly available.

representative citing papers

TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design

cs.CV · 2026-05-20 · conditional · novelty 7.0

TASTE supplies designer ratings across nine criteria for outputs from four text-to-image models, with statistical tests showing moderate agreement and benchmarks where existing scorers reach at most 0.55 macro agreement while a new head reaches 0.611.

Structural Evaluation Metrics for SVG Generation via Leave-One-Out Analysis

cs.LG · 2026-04-09 · unverdicted · novelty 7.0

Element-level leave-one-out analysis yields per-element quality scores and four structural metrics (purity, coverage, compactness, locality) that quantify SVG modularity and enable artifact detection.

citing papers explorer

Showing 2 of 2 citing papers.

TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design cs.CV · 2026-05-20 · conditional · none · ref 8 · internal anchor
TASTE supplies designer ratings across nine criteria for outputs from four text-to-image models, with statistical tests showing moderate agreement and benchmarks where existing scorers reach at most 0.55 macro agreement while a new head reaches 0.611.
Structural Evaluation Metrics for SVG Generation via Leave-One-Out Analysis cs.LG · 2026-04-09 · unverdicted · none · ref 23 · internal anchor
Element-level leave-one-out analysis yields per-element quality scores and four structural metrics (purity, coverage, compactness, locality) that quantify SVG modularity and enable artifact detection.

Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks

fields

years

verdicts

representative citing papers

citing papers explorer