pith. machine review for the scientific record. sign in

arxiv: 2605.02521 · v2 · submitted 2026-05-04 · 💻 cs.CV

Recognition: unknown

MooD: Perception-Enhanced Efficient Affective Image Editing via Continuous Valence-Arousal Modeling

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords affective image editingvalence-arousal modelingcontinuous emotion controlimage editing frameworkperception-enhanced guidanceAffectSet dataset
0
0 comments X

The pith

MooD uses continuous valence-arousal values to guide fine-grained and efficient affective image editing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MooD as a framework for affective image editing that takes continuous valence-arousal values directly as instructions to modify images and evoke targeted emotions. Earlier approaches depended on discrete emotion labels, which restricted nuanced control and slowed inference in practical settings. MooD adds a VA-Aware retrieval step to link affective inputs to visual details, then applies visual transfer and perception-enhanced guidance for the edits. The authors also release AffectSet, a new VA-annotated dataset spanning social and natural scenes, to train and test the system. If successful, the approach would make emotion-driven editing more precise, expressive, and fast enough for interactive social applications.

Core claim

MooD is the first framework that directly leverages continuous Valence-Arousal (VA) values as editing instruction for fine-grained and efficient AIE in computational social systems, integrating a VA-Aware retrieval strategy with visual transfer and perception-enhanced semantic guidance to achieve controllable editing while introducing AffectSet to cover diverse scenarios.

What carries the argument

VA-Aware retrieval strategy that connects continuous affective values to concrete visual semantics for the editing process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The continuous VA approach could support real-time sliders for emotion adjustment in photo or video apps.
  • AffectSet may become a standard benchmark for testing other VA-based vision models beyond editing.
  • The retrieval-plus-guidance pattern might transfer to related tasks such as style transfer conditioned on affect.

Load-bearing premise

That the VA-Aware retrieval strategy can reliably bridge continuous affective values to detailed visual semantics without introducing artifacts or losing controllability in diverse scenes.

What would settle it

Human ratings or automated VA-prediction error on a held-out subset of AffectSet natural scenes, comparing MooD edits against discrete-emotion baselines for both affective match accuracy and visible artifacts.

Figures

Figures reproduced from arXiv: 2605.02521 by Hao Wang, Junxiao Xue, Meicong Si, Shi Chen, Tingqi Hu, Xinyi Yin, Xuecheng Wu, Yiduo Wang, Yunyun Shi.

Figure 1
Figure 1. Figure 1: The comparisons between (a) prompt-based AIE meth￾ods (e.g., ControlNet [8]) requiring lengthy descriptive inputs and (b) our MooD framework, which takes pure VA values as direct input for continuous AIE. systems, such affective modulation also benefits various social applications, such as public opinion analysis [5], human￾computer interaction [6], and healthcare [7]. This has given rise to the task conce… view at source ↗
Figure 3
Figure 3. Figure 3: The overall illustration of our MooD framework. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Comparison with state-of-the-art methods. Our MooD outperforms existing approaches in both affective [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The visualization of smooth affective transitions achieved by our MooD in the VA space. Each row (left to right) reflects [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The qualitative ablation results of Key Components in our MooD, demonstrating that both VA-Aware retrieval strategy [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of τ on our VA-Aware retrieval strategy. systematic ablation study is conducted from four aspects: the roles of the main components, the VA retrieval threshold τ , the number of output tokens N in ASP-Net, and the weights of the loss functions (α, β). Component contribution analysis. As shown in Table IV, directly using VA values as prompts for SDXL achieves the highest PSNR and SSIM but results in … view at source ↗
read the original abstract

Affective Image Editing (AIE) aims to modify visual content to evoke targeted emotions. Although current approaches achieve impressive editing quality, they often overlook inference efficiency, which limits their applicability in computational social scenarios. Moreover, most methods depend on discrete emotion representations, which hinder the continuous modeling of complex human emotions and constrain expressive capabilities in interactive scenarios. To tackle these gaps, we propose MooD, the first framework that directly leverages continuous Valence-Arousal (VA) values as editing instruction for fine-grained and efficient AIE in computational social systems. Specifically, we first introduce a VA-Aware retrieval strategy to bridge vague affective values and detailed visual semantics. Building upon this, MooD integrates visual transfer and perception-enhanced semantic guidance to achieve controllable AIE. Furthermore, considering that existing VA-annotated datasets mainly focus on social scenarios and largely overlook natural scenes, we therefore construct AffectSet, a comprehensive VA-annotated dataset covering diverse scenarios, to support model optimization and evaluation. Extensive qualitative and quantitative experimental results demonstrate that our MooD achieves superior performance in both affective controllability and visual fidelity while maintaining high efficiency. A series of ablation studies further reveal the crucial factors of our design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MooD, the first framework for affective image editing that directly uses continuous valence-arousal (VA) values as editing instructions. It introduces a VA-Aware retrieval strategy to bridge affective values with visual semantics, combines this with visual transfer and perception-enhanced semantic guidance for controllable editing, and constructs the AffectSet dataset covering diverse (including natural) scenes to address limitations in prior VA-annotated data. The authors claim superior affective controllability, visual fidelity, and inference efficiency over existing methods, supported by qualitative/quantitative experiments and ablation studies.

Significance. If the central claims hold, MooD would advance affective image editing by enabling fine-grained continuous control over emotions without discrete categories, while improving efficiency for computational social applications. The AffectSet dataset could serve as a useful resource for VA modeling in underrepresented scenes.

major comments (2)
  1. [VA-Aware retrieval strategy (and associated experiments)] The central claim of superior controllability and fidelity rests on the VA-Aware retrieval strategy successfully mapping continuous VA inputs to detailed semantics without artifacts. The manuscript provides no quantitative retrieval metrics (e.g., interpolation error, semantic consistency, or accuracy across the VA continuum) or generalization tests for out-of-distribution VA values or underrepresented scenes, leaving the bridging assumption unverified and load-bearing for the efficiency/controllability advantages.
  2. [Experimental results] §4 (experimental results): The abstract states that quantitative results demonstrate superiority, yet no specific metrics, baseline comparisons, error bars, or implementation details (e.g., retrieval implementation, training hyperparameters) appear in the provided text. This prevents assessment of whether reported gains in controllability/fidelity are statistically meaningful or artifact-free.
minor comments (2)
  1. [Method] Clarify the exact retrieval mechanism (nearest-neighbor, embedding similarity, etc.) and any learned components in the VA-Aware strategy to allow reproducibility.
  2. [Discussion or Conclusion] Add explicit discussion of limitations, including potential failure modes for extreme VA values or complex natural scenes not well-represented in AffectSet.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to strengthen the presentation of the VA-Aware retrieval strategy and the experimental results.

read point-by-point responses
  1. Referee: [VA-Aware retrieval strategy (and associated experiments)] The central claim of superior controllability and fidelity rests on the VA-Aware retrieval strategy successfully mapping continuous VA inputs to detailed semantics without artifacts. The manuscript provides no quantitative retrieval metrics (e.g., interpolation error, semantic consistency, or accuracy across the VA continuum) or generalization tests for out-of-distribution VA values or underrepresented scenes, leaving the bridging assumption unverified and load-bearing for the efficiency/controllability advantages.

    Authors: We agree that direct quantitative validation of the VA-Aware retrieval would make the central claims more robust. While the end-to-end affective editing results (controllability and fidelity) provide indirect support for the retrieval mapping, we will add explicit quantitative retrieval metrics in the revised manuscript. These will include interpolation error across the continuous VA space, semantic consistency via CLIP-based feature similarity between retrieved and target semantics, retrieval accuracy stratified by VA regions, and generalization tests on out-of-distribution VA values as well as underrepresented natural scenes from AffectSet. These results will be reported in an expanded experimental section or dedicated subsection. revision: yes

  2. Referee: [Experimental results] §4 (experimental results): The abstract states that quantitative results demonstrate superiority, yet no specific metrics, baseline comparisons, error bars, or implementation details (e.g., retrieval implementation, training hyperparameters) appear in the provided text. This prevents assessment of whether reported gains in controllability/fidelity are statistically meaningful or artifact-free.

    Authors: We apologize for any lack of clarity in the text provided to the referee. The full manuscript contains quantitative comparisons, but to address this concern directly we will expand §4 with explicit numerical metrics (affective controllability scores, FID/perceptual similarity for fidelity, and FPS for efficiency), side-by-side baseline comparisons against prior AIE methods, error bars computed over multiple random seeds, and full implementation details including the VA-aware retrieval procedure, training hyperparameters, optimizer settings, and dataset usage. This will enable readers to fully assess statistical significance and reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; claims rest on independent strategy and dataset

full rationale

The paper introduces MooD as a new framework leveraging continuous VA values via a VA-Aware retrieval strategy and constructs AffectSet as a new dataset to address prior limitations. No equations, derivations, or fitted parameters are shown that reduce by construction to inputs, and no self-citations are invoked as load-bearing for uniqueness theorems or ansatzes. Performance claims are supported by described experiments rather than self-referential fitting, making the derivation chain self-contained against external benchmarks with no identifiable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the VA-aware retrieval and perception-enhanced guidance are presented as novel components without stated assumptions or fitted constants.

pith-pipeline@v0.9.0 · 5537 in / 1005 out tokens · 32314 ms · 2026-05-14T20:53:35.476495+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 4 internal anchors

  1. [1]

    Avf-mae++: Scaling affective video facial masked autoencoders via efficient audio-visual self-supervised learning,

    X. Wu, H. Sun, Y . Wang, J. Nie, J. Zhang, Y . Wang, J. Xue, and L. He, “Avf-mae++: Scaling affective video facial masked autoencoders via efficient audio-visual self-supervised learning,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 9142–

  2. [2]

    Towards emotion analysis in short-form videos: A large-scale dataset and baseline,

    X. Wu, H. Sun, J. Xue, J. Nie, X. Kong, R. Zhai, D. Huang, and L. He, “Towards emotion analysis in short-form videos: A large-scale dataset and baseline,” inProceedings of the 2025 International Conference on Multimedia Retrieval, 2025, pp. 1497–1506. 1

  3. [3]

    Meas: Multimodal emotion analysis system for short videos on social media platforms,

    Q. Wei, Y . Zhou, S. Xiang, L. Xiao, and Y . Zhang, “Meas: Multimodal emotion analysis system for short videos on social media platforms,” IEEE Transactions on Computational Social Systems, vol. 12, no. 5, pp. 2398–2410, 2025. 1

  4. [4]

    Fine-grained emotion compre- hension: Semisupervised multimodal emotion and intensity recognition,

    Z. Fang, Z. Liu, T. Liu, and C.-C. Hung, “Fine-grained emotion compre- hension: Semisupervised multimodal emotion and intensity recognition,” IEEE Transactions on Computational Social Systems, vol. 12, no. 3, pp. 1145–1163, 2025. 1

  5. [5]

    Improving multimodal sentiment analysis via modality optimization and dynamic primary modality selection,

    D. Yang, M. Li, X. Wu, Z. Chen, K. Jiang, K. Liu, P. Zhai, and L. Zhang, “Improving multimodal sentiment analysis via modality optimization and dynamic primary modality selection,”arXiv preprint arXiv:2511.06328, 2025. 1

  6. [6]

    Towards context-aware emotion recognition debiasing from a causal demystifi- cation perspective via de-confounded training,

    D. Yang, K. Yang, H. Kuang, Z. Chen, Y . Wang, and L. Zhang, “Towards context-aware emotion recognition debiasing from a causal demystifi- cation perspective via de-confounded training,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10 663– 10 680, 2024. 1

  7. [7]

    A multimodal sentiment analysis approach based on multiview cross-modal fusion,

    Y . Zhi, J. Li, H. Wang, and J. Chen, “A multimodal sentiment analysis approach based on multiview cross-modal fusion,”IEEE Transactions on Computational Social Systems, vol. 13, no. 1, pp. 136–151, 2026. 1

  8. [8]

    Adding conditional control to text-to-image diffusion models,

    L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3813–3824. 1, 3, 6

  9. [9]

    Emotional image color transfer via deep learning,

    D. Liu, Y . Jiang, M. Pei, and S. Liu, “Emotional image color transfer via deep learning,”Pattern Recognition Letters, vol. 110, pp. 16–22, 2018. 1

  10. [10]

    Texture-aware emotional color transfer between images,

    S. Liu and M. Pei, “Texture-aware emotional color transfer between images,”IEEE Access, vol. 6, pp. 31 375–31 386, 2018. 1, 3 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS 10

  11. [11]

    Affective image filter: Reflecting emotions from text to images,

    S. Weng, P. Zhang, Z. Chang, X. Wang, S. Li, and B. Shi, “Affective image filter: Reflecting emotions from text to images,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 810–10 819. 1, 6

  12. [12]

    Language-driven artistic style transfer,

    T.-J. Fu, X. E. Wang, and W. Y . Wang, “Language-driven artistic style transfer,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 717–734. 1, 6

  13. [13]

    Perceptually inspired real-time artistic style transfer for video stream,

    D. Kang, F. Tian, and S. Seo, “Perceptually inspired real-time artistic style transfer for video stream,”Journal of Real-Time Image Processing, vol. 13, no. 3, pp. 581–589, 2017. 1

  14. [14]

    High- resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695. 1, 3

  15. [15]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis,

    D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” 2023. 1, 2, 3, 5

  16. [16]

    Emoedit: Evoking emotions through image manipulation,

    J. Yang, J. Feng, W. Luo, D. Lischinski, D. Cohen-Or, and H. Huang, “Emoedit: Evoking emotions through image manipulation,” inProceed- ings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24 690–24 699. 1, 2, 3, 6

  17. [17]

    Emokgedit: Training-free affective injection via visual cue transformation,

    J. Zhang and B. Fan, “Emokgedit: Training-free affective injection via visual cue transformation,”arXiv preprint arXiv:2601.12326, 2026. 1

  18. [18]

    Make me happier: Evoking emotions through image diffusion models,

    Q. Lin, J. Zhang, Y .-S. Ong, and M. Zhang, “Make me happier: Evoking emotions through image diffusion models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 16 367–16 376. 1, 3, 6

  19. [19]

    Towards comprehensive interactive change under- standing in remote sensing: A large-scale dataset and dual-granularity enhanced vlm,

    J. Xue, Q. Deng, X. Wu, K. Yao, X. Yin, F. Yu, W. Zhou, Y . Zhong, Y . Liu, and D. Yang, “Towards comprehensive interactive change under- standing in remote sensing: A large-scale dataset and dual-granularity enhanced vlm,”IEEE Transactions on Geoscience and Remote Sensing,

  20. [20]

    Vic-bench: Benchmarking visual-interleaved chain-of-thought capability in mllms with free-style intermediate state representations,

    X. Wu, J. Liu, D. Huang, X. Li, Y . Wang, C. Chen, L. Ma, X. Cao, and J. Xue, “Vic-bench: Benchmarking visual-interleaved chain-of-thought capability in mllms with free-style intermediate state representations,” arXiv preprint arXiv:2505.14404, 2025. 1

  21. [21]

    Hkd4vlm: A progressive hybrid knowledge distillation framework for robust mul- timodal hallucination and factuality detection in vlms,

    Z. Zhang, X. Wu, D. Huang, S. Yan, C. Peng, and X. Cao, “Hkd4vlm: A progressive hybrid knowledge distillation framework for robust mul- timodal hallucination and factuality detection in vlms,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 13 881–13 887. 1

  22. [22]

    Emoagent: A multi-agent framework for diverse affective image manipulation,

    Q. Mao, H. Hu, Y . He, D. Gao, H. Chen, and L. Jin, “Emoagent: A multi-agent framework for diverse affective image manipulation,”IEEE Transactions on Affective Computing, pp. 1–18, 2026. 1, 3

  23. [23]

    Towards llm-centric affective visual customization via efficient and precise emotion manipulating

    J. Luo, X. Gu, J. Wang, and J. Lu, “Towards llm-centric affective visual customization via efficient and precise emotion manipulating.” Association for Computing Machinery, 2026, p. 1696–1704. 1

  24. [24]

    A circumplex model of affect

    J. A. Russell, “A circumplex model of affect.”Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980. 2, 3

  25. [25]

    Paint by example: Exemplar-based image editing with diffusion models,

    B. Yang, S. Gu, B. Zhang, T. Zhang, X. Chen, X. Sun, D. Chen, and F. Wen, “Paint by example: Exemplar-based image editing with diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 18 381–18 391. 2, 3

  26. [26]

    Emoticrafter: Text-to-emotional-image generation based on valence-arousal model,

    S. Dang, Y . He, L. Ling, Z. Qian, N. Zhao, and N. Cao, “Emoticrafter: Text-to-emotional-image generation based on valence-arousal model,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 15 218–15 228. 2, 3

  27. [27]

    Emoset: A large-scale visual emotion dataset with rich attributes,

    J. Yang, Q. Huang, T. Ding, D. Lischinski, D. Cohen-Or, and H. Huang, “Emoset: A large-scale visual emotion dataset with rich attributes,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 20 383–20 394. 2, 5

  28. [28]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025. 2, 5

  29. [29]

    Emotional category data on images from the international affective picture system,

    J. A. Mikels, B. L. Fredrickson, G. R. Larkin, C. M. Lindberg, S. J. Maglio, and P. A. Reuter-Lorenz, “Emotional category data on images from the international affective picture system,”Behavior research methods, vol. 37, no. 4, pp. 626–630, 2005. 3

  30. [30]

    A general psychoevolutionary theory of emotion,

    R. Plutchik, “A general psychoevolutionary theory of emotion,” in Theories of Emotion, R. Plutchik and H. Kellerman, Eds. Academic Press, 1980, pp. 3–33. 3

  31. [31]

    Three dimensions of emotion

    H. Schlosberg, “Three dimensions of emotion.”Psychological review, vol. 61, no. 2, p. 81, 1954. 3

  32. [32]

    A study of colour emotion and colour preference. part i: Colour emotions for single colours,

    L.-C. Ou, M. R. Luo, A. Woodcock, and A. Wright, “A study of colour emotion and colour preference. part i: Colour emotions for single colours,”Color Research & Application, vol. 29, no. 3, pp. 232–240,

  33. [33]

    Affective image classification using features inspired by psychology and art theory,

    J. Machajdik and A. Hanbury, “Affective image classification using features inspired by psychology and art theory,” inProceedings of the 18th ACM international conference on Multimedia, 2010, pp. 83–92. 3

  34. [34]

    emotions: A large-scale dataset and audio- visual fusion network for emotion analysis in short-form videos,

    X. Wu, D. Yang, D. Huang, X. Yin, Y . Wang, J. Zhang, J. Nie, L. Fu, Y . Liu, J. Xueet al., “emotions: A large-scale dataset and audio- visual fusion network for emotion analysis in short-form videos,”arXiv preprint arXiv:2508.06902, 2025. 3

  35. [35]

    Emotion recognition in context,

    R. Kosti, J. M. Alvarez, A. Recasens, and A. Lapedriza, “Emotion recognition in context,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1667–1675. 3

  36. [36]

    Emotion schemas are embedded in the human visual system,

    P. A. Kragel, M. C. Reddan, K. S. LaBar, and T. D. Wager, “Emotion schemas are embedded in the human visual system,”Science advances, vol. 5, no. 7, p. eaaw4358, 2019. 3

  37. [37]

    Prompt-to-prompt image editing with cross attention control,

    A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y . Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross attention control,” 2022. 3

  38. [38]

    Instructpix2pix: Learning to follow image editing instructions,

    T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 18 392– 18 402. 3, 6

  39. [39]

    Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models,

    H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, “Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models,”

  40. [40]

    Msnet: A deep architec- ture using multi-sentiment semantics for sentiment-aware image style transfer,

    S. Sun, J. Jia, H. Wu, Z. Ye, and J. Xing, “Msnet: A deep architec- ture using multi-sentiment semantics for sentiment-aware image style transfer,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5. 3

  41. [41]

    Emostyle: Emotion-driven image stylization,

    J. Yang, Z. Bai, and H. Huang, “Emostyle: Emotion-driven image stylization,” 2025. 3

  42. [42]

    Moodifier: Mllm-enhanced emotion-driven image editing,

    J. Ye and S. X. Huang, “Moodifier: Mllm-enhanced emotion-driven image editing,” 2025. 3

  43. [43]

    Instruction-based image editing with planning, reasoning, and generation,

    L. Ji, C. Qi, and Q. Chen, “Instruction-based image editing with planning, reasoning, and generation,” in2025 IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 17 506–17 515. 3

  44. [44]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inICML, 2021. 4

  45. [45]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. 5

  46. [46]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    C. Meng, Y . He, Y . Song, J. Song, J. Wu, J.-Y . Zhu, and S. Ermon, “Sdedit: Guided image synthesis and editing with stochastic differential equations,”arXiv preprint arXiv:2108.01073, 2021. 6

  47. [47]

    Zero-shot image-to-image translation,

    G. Parmar, K. K. Singh, R. Zhang, Y . Li, J. Lu, and J.-Y . Zhu, “Zero-shot image-to-image translation,” 2023. 6

  48. [48]

    Step1X-Edit: A Practical Framework for General Image Editing

    S. Liu, Y . Han, P. Xing, F. Yin, R. Wang, W. Cheng, J. Liao, Y . Wang, H. Fu, C. Han, G. Li, Y . Peng, Q. Sun, J. Wu, Y . Cai, Z. Ge, R. Ming, L. Xia, X. Zeng, Y . Zhu, B. Jiao, X. Zhang, G. Yu, and D. Jiang, “Step1x- edit: A practical framework for general image editing,”arXiv preprint arXiv:2504.17761, 2025. 6

  49. [49]

    Dreamomni2: Multimodal instruction-based editing and generation,

    B. Xia, B. Peng, Y . Zhang, J. Huang, J. Liu, J. Li, H. Tan, S. Wu, C. Wang, Y . Wang, X. Wu, B. Yu, and J. Jia, “Dreamomni2: Multimodal instruction-based editing and generation,” 2025. 6

  50. [50]

    Emogen: Emotional image content generation with text-to-image diffusion models,

    J. Yang, J. Feng, and H. Huang, “Emogen: Emotional image content generation with text-to-image diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6358–6368. 5

  51. [51]

    W. G. Cochran,Sampling techniques. john wiley & sons, 1977. 5

  52. [52]

    Paszke, S

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K ¨opf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala,PyTorch: an imperative style, high- performance deep learning library. Curran Associates Inc., 2019. 6

  53. [53]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017. 6