AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

· 2026 · cs.AI · arXiv 2605.22645

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Text-to-image (T2I) systems increasingly rely on upstream prompters, either humans or multimodal large language models (MLLMs), to translate user intent into detailed prompts. Yet current benchmarks fix the prompt and only evaluate T2I models, leaving the prompting proficiency of this upstream component entirely unmeasured. We introduce AtelierEval, the first unified benchmark that quantifies prompting proficiency across 360 expert-crafted tasks. Grounded in a cognitive view, it spans three task categories and instantiates tasks using a taxonomy of real-world challenges, with a dual interface for both humans and MLLMs. To enable scalable and reliable evaluation, we propose AtelierJudge, a skill-based, memory-augmented agentic evaluator. It produces subjective and objective scores for prompt-image pairs, achieving a Spearman correlation of 0.79 with human experts, approaching human performance. Extensive experiments benchmark 8 MLLMs against 48 human users across 4 T2I backends, validate AtelierEval as a robust diagnostic tool, and reveal the superiority of mimicry over planning, advocating for an image-augmented direction for future prompters. Our work is released to support future research.

representative citing papers

Distributionally Robust Set Representation Learning Under Inference-Time Element Corruption

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

SW-DRSO optimizes a tractable surrogate of worst-case expected loss over plausible inference-time corruptions using a barycentric adversary approximated via simplex weights.

citing papers explorer

Showing 1 of 1 citing paper.

Distributionally Robust Set Representation Learning Under Inference-Time Element Corruption cs.LG · 2026-05-28 · unverdicted · none · ref 8 · internal anchor
SW-DRSO optimizes a tractable surrogate of worst-case expected loss over plausible inference-time corruptions using a barycentric adversary approximated via simplex weights.

AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

fields

years

verdicts

representative citing papers

citing papers explorer