EPIG: Emotion-Based Prompting for Personalised Image Generation
Pith reviewed 2026-06-27 06:52 UTC · model grok-4.3
The pith
EPIG reduces mean arousal error by 14 percent in text-to-image generation by enriching prompts with valence-arousal structure.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EPIG enriches the emotion-related parts of a prompt using valence-arousal representations and role-aware structuring; the resulting emotion-aware prompts then guide the generative process toward more emotionally coherent images, cutting mean arousal error by 14 percent relative to naive insertion and 12 percent relative to LLM-based expansion, with the effect reaching 17 percent on subject-heavy prompts and without harming valence alignment or semantic consistency.
What carries the argument
Valence-arousal psychological framework translated into structured, role-aware prompt enrichment that modifies only the input text before it reaches the frozen diffusion model.
If this is right
- The method scales to any prompt that names a concrete subject without requiring model changes.
- Arousal control improves most when the prompt already contains a person, child, or animal.
- Valence alignment and overall semantic content stay within the range of standard CLIPScore values.
- The approach remains usable in settings where training or fine-tuning is unavailable.
Where Pith is reading between the lines
- Similar structured enrichment could be tested on video or 3-D generators that also rely on text prompts.
- The same valence-arousal insertion pattern might reduce the need for post-generation editing in creative tools.
- If arousal control generalizes across different diffusion backbones, prompt-level methods could become a standard first step before model-level alignment.
Load-bearing premise
Translating valence and arousal values into ordinary prompt text will reliably steer a diffusion model toward images whose perceived emotional intensity matches the intended values.
What would settle it
A controlled test in which human raters or an independent arousal estimator assign arousal scores to EPIG-generated images that show no statistically significant reduction in error compared with the two baselines on the same prompt set.
read the original abstract
Text-to-image diffusion models have achieved impressive results in synthesizing high-quality images from natural language prompts. However, commonly used prompting strategies remain relatively generic, limiting the model's ability to accurately express emotional intent and nuanced affective attributes. This work proposes EPIG, a method that enhances emotional expressiveness at the prompt level prior to image generation. Grounded in psychologically informed emotion representations (valence-arousal) and leveraging structured, role-aware prompt enrichment, EPIG enriches emotion-related components of prompts without modifying or retraining the image generation backbone. The resulting emotion-aware prompts guide the generative process toward more emotionally coherent visual outputs, with particular effectiveness in controlling arousal. EPIG is lightweight, training-free, and well suited for resource-constrained and personalized image generation scenarios. Experimental results on a benchmark of 10 diverse prompts show that EPIG reduces mean arousal error compared to strong baselines, including naive insertion and LLM-based prompt expansion, with reductions of 14% and 12%, respectively. These improvements are statistically significant. EPIG also preserves valence alignment and semantic consistency, as measured by CLIPScore and supported by ablation studies. The effect is more pronounced on prompts containing explicit subjects such as humans, children, or animals, where the reduction reaches 17%, highlighting the subject-sensitive behavior of the proposed method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes EPIG, a training-free prompt enrichment technique that incorporates psychologically grounded valence-arousal descriptors via structured, role-aware text additions to improve emotional coherence (especially arousal control) in text-to-image diffusion outputs. On a benchmark of 10 diverse prompts, it reports mean arousal error reductions of 14% versus naive insertion and 12% versus LLM-based expansion (both statistically significant), while preserving valence alignment and CLIPScore; effects are larger (17%) on prompts with explicit human/animal subjects, supported by ablations.
Significance. If the empirical results hold under scrutiny, EPIG demonstrates that lightweight, external prompt-level interventions grounded in affective psychology can measurably improve control over emotional attributes without retraining or modifying the generative backbone. This would be useful for personalized and resource-constrained scenarios, with the subject-sensitive behavior and ablation support adding practical value.
major comments (3)
- [Experimental Results] Experimental Results (abstract and main evaluation): The central claim of statistically significant arousal error reductions (14% and 12%) is presented without error bars, exact p-values, the statistical test employed, raw per-prompt scores, or dataset construction details (selection criteria, exclusion rules, or prompt sources). With only 10 prompts, these omissions make independent verification of significance and generalizability impossible.
- [Methods] Methods (prompt enrichment procedure): The translation from valence-arousal values into structured prompt text is described at a high level but lacks concrete examples of the enrichment templates, role-aware components, or how arousal/valence targets are chosen for each prompt. This step is load-bearing for the claimed steering effect yet cannot be reproduced or stress-tested from the given description.
- [Ablation studies] Ablation studies: The paper states that ablations support the results, but provides no quantitative breakdown of which components (e.g., valence vs. arousal descriptors, role awareness) were removed and their individual impact on the reported error reductions or CLIPScore.
minor comments (2)
- The abstract and text repeatedly use 'statistically significant' without defining the threshold or test; this should be clarified for precision.
- No mention of how CLIPScore was computed (model variant, reference text) or whether valence alignment was measured via the same psychological scales used for prompting.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive comments on our manuscript. We address each of the major comments below, indicating the revisions we will make to improve the paper's clarity, reproducibility, and completeness.
read point-by-point responses
-
Referee: [Experimental Results] Experimental Results (abstract and main evaluation): The central claim of statistically significant arousal error reductions (14% and 12%) is presented without error bars, exact p-values, the statistical test employed, raw per-prompt scores, or dataset construction details (selection criteria, exclusion rules, or prompt sources). With only 10 prompts, these omissions make independent verification of significance and generalizability impossible.
Authors: We agree that these details are necessary for verification and reproducibility. In the revised manuscript we will add error bars to all reported metrics, specify the statistical test used along with exact p-values, include the full set of raw per-prompt scores in an appendix, and provide complete information on prompt sources, selection criteria, and any exclusion rules. revision: yes
-
Referee: [Methods] Methods (prompt enrichment procedure): The translation from valence-arousal values into structured prompt text is described at a high level but lacks concrete examples of the enrichment templates, role-aware components, or how arousal/valence targets are chosen for each prompt. This step is load-bearing for the claimed steering effect yet cannot be reproduced or stress-tested from the given description.
Authors: We acknowledge that concrete examples are required for reproducibility. The revised manuscript will include explicit examples of the enrichment templates, showing the role-aware components and the procedure used to derive target valence-arousal values from each original prompt. revision: yes
-
Referee: [Ablation studies] Ablation studies: The paper states that ablations support the results, but provides no quantitative breakdown of which components (e.g., valence vs. arousal descriptors, role awareness) were removed and their individual impact on the reported error reductions or CLIPScore.
Authors: We will expand the ablation section with quantitative results, including tables that isolate the contribution of each component (valence descriptors, arousal descriptors, and role awareness) to the observed changes in arousal error, valence alignment, and CLIPScore. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper describes a training-free prompting enrichment method grounded in valence-arousal psychology and evaluates it empirically on a 10-prompt benchmark against baselines. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Claims rest on external measurements (arousal error, CLIPScore) rather than any reduction to inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. This is a standard empirical method paper with independent experimental support.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Valence-arousal model from psychology accurately represents emotional intent expressible in natural language prompts
- domain assumption Prompt-level enrichment alone can steer the generative process without model modification or retraining
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 17
Yang, J., Feng, J., Huang, H.: Emogen: Emotional image content generation with text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 17
2024
-
[2]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)
Dang, S., He, Y., Ling, L., Qian, Z., Zhao, N., Cao, N.: Emoticrafter: Text-to- emotional-image generation based on the valence–arousal model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)
2025
-
[3]
Journal of Personality and Social Psychology39(6), 1161–1178 (1980)
Russell, J.A.: A circumplex model of affect. Journal of Personality and Social Psychology39(6), 1161–1178 (1980)
1980
-
[4]
In: Proceedings of the Association for Computational Linguistics (ACL) (2018)
Mohammad, S.M.: Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words. In: Proceedings of the Association for Computational Linguistics (ACL) (2018)
2018
-
[5]
EmoCtrl: Controllable Emotional Image Content Generation
Yang, J., Luo, W., Huang, H.: Emoctrl: Controllable emotional image content generation. arXiv preprint arXiv:2512.22437 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
ICTACT Journal on Communication Technology14(4), 3050–3056 (2023)
Babu, P.R., Kesavan, R.N., Sivaramakrishnan, A., Chaitanya, G.S.: Emogan label-changing approach for emotional state analysis in mobile communication using monkey algorithm. ICTACT Journal on Communication Technology14(4), 3050–3056 (2023)
2023
-
[7]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
2023
-
[8]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.-Y.: Multi-concept customization of text-to-image diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1931–1941 (2023)
1931
-
[9]
In: International Conference on Learning Representations (ICLR) (2022)
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Chen, W.: Lora: Low-rank adaptation of large language models. In: International Conference on Learning Representations (ICLR) (2022)
2022
-
[10]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Prompt-to-Prompt Image Editing with Cross Attention Control
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
In: Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI) (2023)
Wang, Y., Shen, S., Lim, B.Y.: Reprompt: Automatic prompt editing to refine ai-generative art towards precise expressions. In: Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI) (2023)
2023
-
[13]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)
Agarwal, A., Karanam, S., Joseph, K.J., Saxena, A., Goswami, K., Srinivasan, B.V.: A-star: Test-time attention segregation and retention for text-to-image syn- thesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)
2023
-
[14]
Advances in Neural Information Processing Systems36, 66923–66939 (2023)
Hao, Y., Chi, Z., Dong, L., Wei, F.: Optimizing prompts for text-to-image gen- eration. Advances in Neural Information Processing Systems36, 66923–66939 (2023)
2023
-
[15]
In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pp
Mo, W., Zhang, T., Bai, Y., Su, B., Wen, J.-R., Yang, Q.: Dynamic prompt 18 optimizing for text-to-image generation. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pp. 26627–26636 (2024)
2024
-
[16]
In: Proceedings of the ACM Symposium on User Interface Software and Technology (UIST), pp
Brade, S., Wang, B., Sousa, M., Oore, S., Grossman, T.: Promptify: Text-to- image generation through interactive prompt exploration with large language models. In: Proceedings of the ACM Symposium on User Interface Software and Technology (UIST), pp. 1–14 (2023)
2023
-
[17]
Proceedings of the HAI-GEN Workshop (2023)
Rost, M., Andreasson, S.: Stable walk: An interactive environment for exploring stable diffusion outputs. Proceedings of the HAI-GEN Workshop (2023)
2023
-
[18]
Advances in Neural Information Processing Systems (NeurIPS)36, 58648–58669 (2023)
Du, C., Li, Y., Qiu, Z., Xu, C.: Stable diffusion is unstable. Advances in Neural Information Processing Systems (NeurIPS)36, 58648–58669 (2023)
2023
-
[19]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
Mahajan, S., Rahman, T., Yi, K.M., Sigal, L.: Prompting hard or hardly prompt- ing: Prompt inversion for text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6808–6817 (2024)
2024
-
[20]
In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
Liu, B., Wang, C., Cao, T., Jia, K., Huang, J.: Towards understanding cross- and self-attention in stable diffusion for text-guided image editing. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7817–7826 (2024)
2024
-
[21]
In: Eigh- teenth International Conference on Machine Vision (ICMV 2025), vol
Othmen, E., Landolsi, M.Y., Romdhane, L.B.: Enhancing one-step diffusion models using gans with application to mental health mindfulness. In: Eigh- teenth International Conference on Machine Vision (ICMV 2025), vol. 14114, pp. 330–337 (2026). SPIE
2025
-
[22]
In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)
Qi, W., Liu, S., Zhang, T.: Spire: Structured prompting for interpretable image generation. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)
2024
-
[23]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Dat, D.H., Hyeon-Woo, N., Mao, P.-Y., Oh, T.-H.: Vsc: Visual search com- positional text-to-image diffusion model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19153–19162 (2025)
2025
-
[24]
Advances in Neural Information Processing Systems36, 26291–26303 (2023)
Yang, F., Yang, S., Butt, M.A., Weijer, J.: Dynamic prompt learning: Address- ing cross-attention leakage for text-based image editing. Advances in Neural Information Processing Systems36, 26291–26303 (2023)
2023
-
[25]
arXiv preprint arXiv:2307.12980 (2023)
Gu, J., Han, Z., Chen, S., Beirami, A., He, B., Zhang, G., Liao, R., Qin, Y., Tresp, V., Torr, P.: A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980 (2023)
-
[26]
arXiv preprint arXiv:2305.13655 (2023)
Lian, L., Li, B., Yala, A., Darrell, T.: Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655 (2023)
-
[27]
arXiv preprint arXiv:2401.10061 (2024)
Qin, J., Wu, J., Chen, W., et al.: Diffusiongpt: Llm-driven text-to-image generation system. arXiv preprint arXiv:2401.10061 (2024)
-
[28]
In: Proceedings of the 19 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
Chen, Z., Zhang, L., Weng, F., Pan, L., Lan, Z.: Tailored visions: Enhancing text- to-image generation with personalized prompt rewriting. In: Proceedings of the 19 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7727–7736 (2024)
2024
-
[29]
In: Proceedings of EMNLP, pp
Jeon, J., Oh, J., Lee, H., Lee, B.-J.: Iterative prompt refinement for safer text- to-image generation. In: Proceedings of EMNLP, pp. 18091–18107 (2025)
2025
-
[30]
In: Proceedings of the EMNLP Industry Track, pp
Cao, T., Wang, C., Liu, B., Wu, Z., Zhu, J., Huang, J.: Beautifulprompt: Towards automatic prompt engineering for text-to-image synthesis. In: Proceedings of the EMNLP Industry Track, pp. 1–11 (2023)
2023
-
[31]
In: CCF NLPCC, pp
Li, W., Wang, J., Zhang, X.: Promptist: Automated prompt optimization for text-to-image synthesis. In: CCF NLPCC, pp. 295–306 (2024)
2024
-
[32]
Computational Intelligence29(3), 436–465 (2013)
Mohammad, S.M., Turney, P.D.: Crowdsourcing a word–emotion association lexicon. Computational Intelligence29(3), 436–465 (2013)
2013
-
[33]
In: ACM SIGGRAPH 2023 Conference Proceedings (2023)
Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. In: ACM SIGGRAPH 2023 Conference Proceedings (2023)
2023
-
[34]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
Wang, Z., Sha, Z., Ding, Z., Wang, Y., Tu, Z.: Tokencompose: Text-to-image dif- fusion with token-level supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8553–8564 (2024)
2024
-
[35]
Adversarial diffusion distillation, 2023
Sauer, A., Boesel, F., Dockhorn, T., Blattmann, A., Esser, P., Rombach, R.: Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042 (2023)
-
[36]
biometrics, 159–174 (1977)
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. biometrics, 159–174 (1977)
1977
-
[37]
In: International Conference on Machine Learning, pp
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PmLR 20
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.