DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

Daoyuan Chen; Qirui Jiao; Xika Lin; Yaliang Li; Yilun Huang; Ying Shen

arxiv: 2505.16915 · v3 · pith:DAFSZYOInew · submitted 2025-05-22 · 💻 cs.CV · cs.AI

DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

Qirui Jiao , Daoyuan Chen , Yilun Huang , Xika Lin , Ying Shen , Yaliang Li This is my paper

classification 💻 cs.CV cs.AI

keywords promptslongmodelsattributesbenchmarkcapabilitiescharactercritical

0 comments

read the original abstract

While recent Text-to-Image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, they struggle with the long, detailed prompts required for professional applications. We present DetailMaster, a comprehensive benchmark for evaluating T2I capabilities on long prompts with complex compositional requirements, accompanied by an automated data construction pipeline and an evaluation workflow. Comprising expert-validated prompts averaging 284.89 tokens, our benchmark introduces four critical evaluation dimensions: Character Attributes, Structured Character Locations, Multi-Dimensional Scene Attributes, and Spatial/Interactive Relationships. Evaluations on various general-purpose and long-prompt-optimized models reveal critical performance limitations, showing that weak encoders struggle to preserve syntactic dependencies within prompts and diffusion models suffer from attribute leakage under detail-intensive conditions. Through a controlled ablation study under varying constraints, we further show that high-fidelity generation requires a synergistic combination of expanded prompt limits and long-prompt training. We open-source our dataset and code to foster progress in long-prompt-driven T2I generation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Long-Text-to-Image Generation via Compositional Prompt Decomposition
cs.CV 2026-04 unverdicted novelty 7.0

PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models whil...