EPIG: Emotion-Based Prompting for Personalised Image Generation

Emna Othmen; Lotfi Ben Romdhane; Mohamed Yassine Landolsi

arxiv: 2606.13247 · v1 · pith:EYK5DJ6Unew · submitted 2026-06-11 · 💻 cs.AI

EPIG: Emotion-Based Prompting for Personalised Image Generation

Emna Othmen , Mohamed Yassine Landolsi , Lotfi Ben Romdhane This is my paper

Pith reviewed 2026-06-27 06:52 UTC · model grok-4.3

classification 💻 cs.AI

keywords emotion-based promptingvalence-arousaltext-to-image diffusionprompt enrichmentarousal controlpersonalized generationtraining-free method

0 comments

The pith

EPIG reduces mean arousal error by 14 percent in text-to-image generation by enriching prompts with valence-arousal structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that generic prompts limit emotional expression in diffusion models and that inserting psychologically grounded valence-arousal details into prompts can steer outputs toward more coherent affect without retraining the model. A reader would care because most current image generators produce emotionally flat results even when users want specific moods, and the method stays lightweight enough for personal or low-resource use. Experiments on ten prompts demonstrate statistically significant drops in arousal mismatch versus naive insertion and LLM expansion baselines while valence alignment and CLIPScore remain intact. The gains are largest on prompts that name explicit subjects such as people or animals.

Core claim

EPIG enriches the emotion-related parts of a prompt using valence-arousal representations and role-aware structuring; the resulting emotion-aware prompts then guide the generative process toward more emotionally coherent images, cutting mean arousal error by 14 percent relative to naive insertion and 12 percent relative to LLM-based expansion, with the effect reaching 17 percent on subject-heavy prompts and without harming valence alignment or semantic consistency.

What carries the argument

Valence-arousal psychological framework translated into structured, role-aware prompt enrichment that modifies only the input text before it reaches the frozen diffusion model.

If this is right

The method scales to any prompt that names a concrete subject without requiring model changes.
Arousal control improves most when the prompt already contains a person, child, or animal.
Valence alignment and overall semantic content stay within the range of standard CLIPScore values.
The approach remains usable in settings where training or fine-tuning is unavailable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar structured enrichment could be tested on video or 3-D generators that also rely on text prompts.
The same valence-arousal insertion pattern might reduce the need for post-generation editing in creative tools.
If arousal control generalizes across different diffusion backbones, prompt-level methods could become a standard first step before model-level alignment.

Load-bearing premise

Translating valence and arousal values into ordinary prompt text will reliably steer a diffusion model toward images whose perceived emotional intensity matches the intended values.

What would settle it

A controlled test in which human raters or an independent arousal estimator assign arousal scores to EPIG-generated images that show no statistically significant reduction in error compared with the two baselines on the same prompt set.

read the original abstract

Text-to-image diffusion models have achieved impressive results in synthesizing high-quality images from natural language prompts. However, commonly used prompting strategies remain relatively generic, limiting the model's ability to accurately express emotional intent and nuanced affective attributes. This work proposes EPIG, a method that enhances emotional expressiveness at the prompt level prior to image generation. Grounded in psychologically informed emotion representations (valence-arousal) and leveraging structured, role-aware prompt enrichment, EPIG enriches emotion-related components of prompts without modifying or retraining the image generation backbone. The resulting emotion-aware prompts guide the generative process toward more emotionally coherent visual outputs, with particular effectiveness in controlling arousal. EPIG is lightweight, training-free, and well suited for resource-constrained and personalized image generation scenarios. Experimental results on a benchmark of 10 diverse prompts show that EPIG reduces mean arousal error compared to strong baselines, including naive insertion and LLM-based prompt expansion, with reductions of 14% and 12%, respectively. These improvements are statistically significant. EPIG also preserves valence alignment and semantic consistency, as measured by CLIPScore and supported by ablation studies. The effect is more pronounced on prompts containing explicit subjects such as humans, children, or animals, where the reduction reaches 17%, highlighting the subject-sensitive behavior of the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EPIG is a simple training-free prompt tweak using valence-arousal that gives modest arousal gains on a 10-prompt test but needs more evidence to judge.

read the letter

The main thing here is that EPIG enriches prompts with structured valence-arousal descriptors and role-aware elements to steer diffusion models toward better emotional outputs, especially arousal, without any training or model changes.

It does a few things cleanly. The method stays lightweight and external to the generator, which suits personalized or low-resource use. They test against straightforward baselines like direct insertion and LLM expansion, report 14% and 12% drops in mean arousal error, claim statistical significance, run ablations, and show valence and CLIPScore stay intact. The larger effect on prompts with people, children, or animals is noted and makes sense given the subject focus.

The evaluation is the clear weak point. A benchmark of ten prompts is small, and the abstract gives no error bars, raw data, or full details on how the significance tests were run or how prompts were chosen. That makes the central claim harder to assess without the methods section. The assumption that valence-arousal text reliably translates into visual emotional control is used but not deeply tested beyond the reported metrics.

This paper is for people doing prompt engineering on existing image models who want a quick way to add emotional nuance. A reader focused on practical tweaks would get the method and the comparative numbers.

I would flag it for a reading group as maybe. I would not cite it in my own work soon. It deserves peer review so the stats, dataset, and reproducibility can be checked properly.

Referee Report

3 major / 2 minor

Summary. The paper proposes EPIG, a training-free prompt enrichment technique that incorporates psychologically grounded valence-arousal descriptors via structured, role-aware text additions to improve emotional coherence (especially arousal control) in text-to-image diffusion outputs. On a benchmark of 10 diverse prompts, it reports mean arousal error reductions of 14% versus naive insertion and 12% versus LLM-based expansion (both statistically significant), while preserving valence alignment and CLIPScore; effects are larger (17%) on prompts with explicit human/animal subjects, supported by ablations.

Significance. If the empirical results hold under scrutiny, EPIG demonstrates that lightweight, external prompt-level interventions grounded in affective psychology can measurably improve control over emotional attributes without retraining or modifying the generative backbone. This would be useful for personalized and resource-constrained scenarios, with the subject-sensitive behavior and ablation support adding practical value.

major comments (3)

[Experimental Results] Experimental Results (abstract and main evaluation): The central claim of statistically significant arousal error reductions (14% and 12%) is presented without error bars, exact p-values, the statistical test employed, raw per-prompt scores, or dataset construction details (selection criteria, exclusion rules, or prompt sources). With only 10 prompts, these omissions make independent verification of significance and generalizability impossible.
[Methods] Methods (prompt enrichment procedure): The translation from valence-arousal values into structured prompt text is described at a high level but lacks concrete examples of the enrichment templates, role-aware components, or how arousal/valence targets are chosen for each prompt. This step is load-bearing for the claimed steering effect yet cannot be reproduced or stress-tested from the given description.
[Ablation studies] Ablation studies: The paper states that ablations support the results, but provides no quantitative breakdown of which components (e.g., valence vs. arousal descriptors, role awareness) were removed and their individual impact on the reported error reductions or CLIPScore.

minor comments (2)

The abstract and text repeatedly use 'statistically significant' without defining the threshold or test; this should be clarified for precision.
No mention of how CLIPScore was computed (model variant, reference text) or whether valence alignment was measured via the same psychological scales used for prompting.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments on our manuscript. We address each of the major comments below, indicating the revisions we will make to improve the paper's clarity, reproducibility, and completeness.

read point-by-point responses

Referee: [Experimental Results] Experimental Results (abstract and main evaluation): The central claim of statistically significant arousal error reductions (14% and 12%) is presented without error bars, exact p-values, the statistical test employed, raw per-prompt scores, or dataset construction details (selection criteria, exclusion rules, or prompt sources). With only 10 prompts, these omissions make independent verification of significance and generalizability impossible.

Authors: We agree that these details are necessary for verification and reproducibility. In the revised manuscript we will add error bars to all reported metrics, specify the statistical test used along with exact p-values, include the full set of raw per-prompt scores in an appendix, and provide complete information on prompt sources, selection criteria, and any exclusion rules. revision: yes
Referee: [Methods] Methods (prompt enrichment procedure): The translation from valence-arousal values into structured prompt text is described at a high level but lacks concrete examples of the enrichment templates, role-aware components, or how arousal/valence targets are chosen for each prompt. This step is load-bearing for the claimed steering effect yet cannot be reproduced or stress-tested from the given description.

Authors: We acknowledge that concrete examples are required for reproducibility. The revised manuscript will include explicit examples of the enrichment templates, showing the role-aware components and the procedure used to derive target valence-arousal values from each original prompt. revision: yes
Referee: [Ablation studies] Ablation studies: The paper states that ablations support the results, but provides no quantitative breakdown of which components (e.g., valence vs. arousal descriptors, role awareness) were removed and their individual impact on the reported error reductions or CLIPScore.

Authors: We will expand the ablation section with quantitative results, including tables that isolate the contribution of each component (valence descriptors, arousal descriptors, and role awareness) to the observed changes in arousal error, valence alignment, and CLIPScore. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes a training-free prompting enrichment method grounded in valence-arousal psychology and evaluates it empirically on a 10-prompt benchmark against baselines. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Claims rest on external measurements (arousal error, CLIPScore) rather than any reduction to inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. This is a standard empirical method paper with independent experimental support.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on the domain assumption that valence-arousal representations translate effectively into prompt text that diffusion models can interpret for emotional control; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Valence-arousal model from psychology accurately represents emotional intent expressible in natural language prompts
Invoked when the paper states the method is grounded in psychologically informed emotion representations (valence-arousal).
domain assumption Prompt-level enrichment alone can steer the generative process without model modification or retraining
Core premise of the training-free claim.

pith-pipeline@v0.9.1-grok · 5773 in / 1318 out tokens · 24546 ms · 2026-06-27T06:52:18.589834+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 7 canonical work pages · 3 internal anchors

[1]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 17

Yang, J., Feng, J., Huang, H.: Emogen: Emotional image content generation with text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 17

2024
[2]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

Dang, S., He, Y., Ling, L., Qian, Z., Zhao, N., Cao, N.: Emoticrafter: Text-to- emotional-image generation based on the valence–arousal model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

2025
[3]

Journal of Personality and Social Psychology39(6), 1161–1178 (1980)

Russell, J.A.: A circumplex model of affect. Journal of Personality and Social Psychology39(6), 1161–1178 (1980)

1980
[4]

In: Proceedings of the Association for Computational Linguistics (ACL) (2018)

Mohammad, S.M.: Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words. In: Proceedings of the Association for Computational Linguistics (ACL) (2018)

2018
[5]

EmoCtrl: Controllable Emotional Image Content Generation

Yang, J., Luo, W., Huang, H.: Emoctrl: Controllable emotional image content generation. arXiv preprint arXiv:2512.22437 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

ICTACT Journal on Communication Technology14(4), 3050–3056 (2023)

Babu, P.R., Kesavan, R.N., Sivaramakrishnan, A., Chaitanya, G.S.: Emogan label-changing approach for emotional state analysis in mobile communication using monkey algorithm. ICTACT Journal on Communication Technology14(4), 3050–3056 (2023)

2023
[7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

2023
[8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.-Y.: Multi-concept customization of text-to-image diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1931–1941 (2023)

1931
[9]

In: International Conference on Learning Representations (ICLR) (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Chen, W.: Lora: Low-rank adaptation of large language models. In: International Conference on Learning Representations (ICLR) (2022)

2022
[10]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Prompt-to-Prompt Image Editing with Cross Attention Control

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

In: Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI) (2023)

Wang, Y., Shen, S., Lim, B.Y.: Reprompt: Automatic prompt editing to refine ai-generative art towards precise expressions. In: Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI) (2023)

2023
[13]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

Agarwal, A., Karanam, S., Joseph, K.J., Saxena, A., Goswami, K., Srinivasan, B.V.: A-star: Test-time attention segregation and retention for text-to-image syn- thesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

2023
[14]

Advances in Neural Information Processing Systems36, 66923–66939 (2023)

Hao, Y., Chi, Z., Dong, L., Wei, F.: Optimizing prompts for text-to-image gen- eration. Advances in Neural Information Processing Systems36, 66923–66939 (2023)

2023
[15]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pp

Mo, W., Zhang, T., Bai, Y., Su, B., Wen, J.-R., Yang, Q.: Dynamic prompt 18 optimizing for text-to-image generation. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pp. 26627–26636 (2024)

2024
[16]

In: Proceedings of the ACM Symposium on User Interface Software and Technology (UIST), pp

Brade, S., Wang, B., Sousa, M., Oore, S., Grossman, T.: Promptify: Text-to- image generation through interactive prompt exploration with large language models. In: Proceedings of the ACM Symposium on User Interface Software and Technology (UIST), pp. 1–14 (2023)

2023
[17]

Proceedings of the HAI-GEN Workshop (2023)

Rost, M., Andreasson, S.: Stable walk: An interactive environment for exploring stable diffusion outputs. Proceedings of the HAI-GEN Workshop (2023)

2023
[18]

Advances in Neural Information Processing Systems (NeurIPS)36, 58648–58669 (2023)

Du, C., Li, Y., Qiu, Z., Xu, C.: Stable diffusion is unstable. Advances in Neural Information Processing Systems (NeurIPS)36, 58648–58669 (2023)

2023
[19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Mahajan, S., Rahman, T., Yi, K.M., Sigal, L.: Prompting hard or hardly prompt- ing: Prompt inversion for text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6808–6817 (2024)

2024
[20]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Liu, B., Wang, C., Cao, T., Jia, K., Huang, J.: Towards understanding cross- and self-attention in stable diffusion for text-guided image editing. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7817–7826 (2024)

2024
[21]

In: Eigh- teenth International Conference on Machine Vision (ICMV 2025), vol

Othmen, E., Landolsi, M.Y., Romdhane, L.B.: Enhancing one-step diffusion models using gans with application to mental health mindfulness. In: Eigh- teenth International Conference on Machine Vision (ICMV 2025), vol. 14114, pp. 330–337 (2026). SPIE

2025
[22]

In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)

Qi, W., Liu, S., Zhang, T.: Spire: Structured prompting for interpretable image generation. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)

2024
[23]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Dat, D.H., Hyeon-Woo, N., Mao, P.-Y., Oh, T.-H.: Vsc: Visual search com- positional text-to-image diffusion model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19153–19162 (2025)

2025
[24]

Advances in Neural Information Processing Systems36, 26291–26303 (2023)

Yang, F., Yang, S., Butt, M.A., Weijer, J.: Dynamic prompt learning: Address- ing cross-attention leakage for text-based image editing. Advances in Neural Information Processing Systems36, 26291–26303 (2023)

2023
[25]

arXiv preprint arXiv:2307.12980 (2023)

Gu, J., Han, Z., Chen, S., Beirami, A., He, B., Zhang, G., Liao, R., Qin, Y., Tresp, V., Torr, P.: A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980 (2023)

work page arXiv 2023
[26]

arXiv preprint arXiv:2305.13655 (2023)

Lian, L., Li, B., Yala, A., Darrell, T.: Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655 (2023)

work page arXiv 2023
[27]

arXiv preprint arXiv:2401.10061 (2024)

Qin, J., Wu, J., Chen, W., et al.: Diffusiongpt: Llm-driven text-to-image generation system. arXiv preprint arXiv:2401.10061 (2024)

work page arXiv 2024
[28]

In: Proceedings of the 19 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Chen, Z., Zhang, L., Weng, F., Pan, L., Lan, Z.: Tailored visions: Enhancing text- to-image generation with personalized prompt rewriting. In: Proceedings of the 19 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7727–7736 (2024)

2024
[29]

In: Proceedings of EMNLP, pp

Jeon, J., Oh, J., Lee, H., Lee, B.-J.: Iterative prompt refinement for safer text- to-image generation. In: Proceedings of EMNLP, pp. 18091–18107 (2025)

2025
[30]

In: Proceedings of the EMNLP Industry Track, pp

Cao, T., Wang, C., Liu, B., Wu, Z., Zhu, J., Huang, J.: Beautifulprompt: Towards automatic prompt engineering for text-to-image synthesis. In: Proceedings of the EMNLP Industry Track, pp. 1–11 (2023)

2023
[31]

In: CCF NLPCC, pp

Li, W., Wang, J., Zhang, X.: Promptist: Automated prompt optimization for text-to-image synthesis. In: CCF NLPCC, pp. 295–306 (2024)

2024
[32]

Computational Intelligence29(3), 436–465 (2013)

Mohammad, S.M., Turney, P.D.: Crowdsourcing a word–emotion association lexicon. Computational Intelligence29(3), 436–465 (2013)

2013
[33]

In: ACM SIGGRAPH 2023 Conference Proceedings (2023)

Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. In: ACM SIGGRAPH 2023 Conference Proceedings (2023)

2023
[34]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Wang, Z., Sha, Z., Ding, Z., Wang, Y., Tu, Z.: Tokencompose: Text-to-image dif- fusion with token-level supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8553–8564 (2024)

2024
[35]

Adversarial diffusion distillation, 2023

Sauer, A., Boesel, F., Dockhorn, T., Blattmann, A., Esser, P., Rombach, R.: Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042 (2023)

work page arXiv 2023
[36]

biometrics, 159–174 (1977)

Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. biometrics, 159–174 (1977)

1977
[37]

In: International Conference on Machine Learning, pp

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PmLR 20

2021

[1] [1]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 17

Yang, J., Feng, J., Huang, H.: Emogen: Emotional image content generation with text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 17

2024

[2] [2]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

Dang, S., He, Y., Ling, L., Qian, Z., Zhao, N., Cao, N.: Emoticrafter: Text-to- emotional-image generation based on the valence–arousal model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

2025

[3] [3]

Journal of Personality and Social Psychology39(6), 1161–1178 (1980)

Russell, J.A.: A circumplex model of affect. Journal of Personality and Social Psychology39(6), 1161–1178 (1980)

1980

[4] [4]

In: Proceedings of the Association for Computational Linguistics (ACL) (2018)

Mohammad, S.M.: Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words. In: Proceedings of the Association for Computational Linguistics (ACL) (2018)

2018

[5] [5]

EmoCtrl: Controllable Emotional Image Content Generation

Yang, J., Luo, W., Huang, H.: Emoctrl: Controllable emotional image content generation. arXiv preprint arXiv:2512.22437 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

ICTACT Journal on Communication Technology14(4), 3050–3056 (2023)

Babu, P.R., Kesavan, R.N., Sivaramakrishnan, A., Chaitanya, G.S.: Emogan label-changing approach for emotional state analysis in mobile communication using monkey algorithm. ICTACT Journal on Communication Technology14(4), 3050–3056 (2023)

2023

[7] [7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

2023

[8] [8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.-Y.: Multi-concept customization of text-to-image diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1931–1941 (2023)

1931

[9] [9]

In: International Conference on Learning Representations (ICLR) (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Chen, W.: Lora: Low-rank adaptation of large language models. In: International Conference on Learning Representations (ICLR) (2022)

2022

[10] [10]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Prompt-to-Prompt Image Editing with Cross Attention Control

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

In: Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI) (2023)

Wang, Y., Shen, S., Lim, B.Y.: Reprompt: Automatic prompt editing to refine ai-generative art towards precise expressions. In: Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI) (2023)

2023

[13] [13]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

Agarwal, A., Karanam, S., Joseph, K.J., Saxena, A., Goswami, K., Srinivasan, B.V.: A-star: Test-time attention segregation and retention for text-to-image syn- thesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

2023

[14] [14]

Advances in Neural Information Processing Systems36, 66923–66939 (2023)

Hao, Y., Chi, Z., Dong, L., Wei, F.: Optimizing prompts for text-to-image gen- eration. Advances in Neural Information Processing Systems36, 66923–66939 (2023)

2023

[15] [15]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pp

Mo, W., Zhang, T., Bai, Y., Su, B., Wen, J.-R., Yang, Q.: Dynamic prompt 18 optimizing for text-to-image generation. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pp. 26627–26636 (2024)

2024

[16] [16]

In: Proceedings of the ACM Symposium on User Interface Software and Technology (UIST), pp

Brade, S., Wang, B., Sousa, M., Oore, S., Grossman, T.: Promptify: Text-to- image generation through interactive prompt exploration with large language models. In: Proceedings of the ACM Symposium on User Interface Software and Technology (UIST), pp. 1–14 (2023)

2023

[17] [17]

Proceedings of the HAI-GEN Workshop (2023)

Rost, M., Andreasson, S.: Stable walk: An interactive environment for exploring stable diffusion outputs. Proceedings of the HAI-GEN Workshop (2023)

2023

[18] [18]

Advances in Neural Information Processing Systems (NeurIPS)36, 58648–58669 (2023)

Du, C., Li, Y., Qiu, Z., Xu, C.: Stable diffusion is unstable. Advances in Neural Information Processing Systems (NeurIPS)36, 58648–58669 (2023)

2023

[19] [19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Mahajan, S., Rahman, T., Yi, K.M., Sigal, L.: Prompting hard or hardly prompt- ing: Prompt inversion for text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6808–6817 (2024)

2024

[20] [20]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Liu, B., Wang, C., Cao, T., Jia, K., Huang, J.: Towards understanding cross- and self-attention in stable diffusion for text-guided image editing. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7817–7826 (2024)

2024

[21] [21]

In: Eigh- teenth International Conference on Machine Vision (ICMV 2025), vol

Othmen, E., Landolsi, M.Y., Romdhane, L.B.: Enhancing one-step diffusion models using gans with application to mental health mindfulness. In: Eigh- teenth International Conference on Machine Vision (ICMV 2025), vol. 14114, pp. 330–337 (2026). SPIE

2025

[22] [22]

In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)

Qi, W., Liu, S., Zhang, T.: Spire: Structured prompting for interpretable image generation. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)

2024

[23] [23]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Dat, D.H., Hyeon-Woo, N., Mao, P.-Y., Oh, T.-H.: Vsc: Visual search com- positional text-to-image diffusion model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19153–19162 (2025)

2025

[24] [24]

Advances in Neural Information Processing Systems36, 26291–26303 (2023)

Yang, F., Yang, S., Butt, M.A., Weijer, J.: Dynamic prompt learning: Address- ing cross-attention leakage for text-based image editing. Advances in Neural Information Processing Systems36, 26291–26303 (2023)

2023

[25] [25]

arXiv preprint arXiv:2307.12980 (2023)

Gu, J., Han, Z., Chen, S., Beirami, A., He, B., Zhang, G., Liao, R., Qin, Y., Tresp, V., Torr, P.: A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980 (2023)

work page arXiv 2023

[26] [26]

arXiv preprint arXiv:2305.13655 (2023)

Lian, L., Li, B., Yala, A., Darrell, T.: Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655 (2023)

work page arXiv 2023

[27] [27]

arXiv preprint arXiv:2401.10061 (2024)

Qin, J., Wu, J., Chen, W., et al.: Diffusiongpt: Llm-driven text-to-image generation system. arXiv preprint arXiv:2401.10061 (2024)

work page arXiv 2024

[28] [28]

In: Proceedings of the 19 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Chen, Z., Zhang, L., Weng, F., Pan, L., Lan, Z.: Tailored visions: Enhancing text- to-image generation with personalized prompt rewriting. In: Proceedings of the 19 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7727–7736 (2024)

2024

[29] [29]

In: Proceedings of EMNLP, pp

Jeon, J., Oh, J., Lee, H., Lee, B.-J.: Iterative prompt refinement for safer text- to-image generation. In: Proceedings of EMNLP, pp. 18091–18107 (2025)

2025

[30] [30]

In: Proceedings of the EMNLP Industry Track, pp

Cao, T., Wang, C., Liu, B., Wu, Z., Zhu, J., Huang, J.: Beautifulprompt: Towards automatic prompt engineering for text-to-image synthesis. In: Proceedings of the EMNLP Industry Track, pp. 1–11 (2023)

2023

[31] [31]

In: CCF NLPCC, pp

Li, W., Wang, J., Zhang, X.: Promptist: Automated prompt optimization for text-to-image synthesis. In: CCF NLPCC, pp. 295–306 (2024)

2024

[32] [32]

Computational Intelligence29(3), 436–465 (2013)

Mohammad, S.M., Turney, P.D.: Crowdsourcing a word–emotion association lexicon. Computational Intelligence29(3), 436–465 (2013)

2013

[33] [33]

In: ACM SIGGRAPH 2023 Conference Proceedings (2023)

Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. In: ACM SIGGRAPH 2023 Conference Proceedings (2023)

2023

[34] [34]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Wang, Z., Sha, Z., Ding, Z., Wang, Y., Tu, Z.: Tokencompose: Text-to-image dif- fusion with token-level supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8553–8564 (2024)

2024

[35] [35]

Adversarial diffusion distillation, 2023

Sauer, A., Boesel, F., Dockhorn, T., Blattmann, A., Esser, P., Rombach, R.: Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042 (2023)

work page arXiv 2023

[36] [36]

biometrics, 159–174 (1977)

Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. biometrics, 159–174 (1977)

1977

[37] [37]

In: International Conference on Machine Learning, pp

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PmLR 20

2021