arxiv: 2605.02521 · v2 · submitted 2026-05-04 · 💻 cs.CV

Recognition: unknown

MooD: Perception-Enhanced Efficient Affective Image Editing via Continuous Valence-Arousal Modeling

Xinyi Yin , Yiduo Wang , Tingqi Hu , Meicong Si , Yunyun Shi , Shi Chen , Hao Wang , Junxiao Xue

show 1 more author

Xuecheng Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords affective image editingvalence-arousal modelingcontinuous emotion controlimage editing frameworkperception-enhanced guidanceAffectSet dataset

0 comments

The pith

MooD uses continuous valence-arousal values to guide fine-grained and efficient affective image editing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MooD as a framework for affective image editing that takes continuous valence-arousal values directly as instructions to modify images and evoke targeted emotions. Earlier approaches depended on discrete emotion labels, which restricted nuanced control and slowed inference in practical settings. MooD adds a VA-Aware retrieval step to link affective inputs to visual details, then applies visual transfer and perception-enhanced guidance for the edits. The authors also release AffectSet, a new VA-annotated dataset spanning social and natural scenes, to train and test the system. If successful, the approach would make emotion-driven editing more precise, expressive, and fast enough for interactive social applications.

Core claim

MooD is the first framework that directly leverages continuous Valence-Arousal (VA) values as editing instruction for fine-grained and efficient AIE in computational social systems, integrating a VA-Aware retrieval strategy with visual transfer and perception-enhanced semantic guidance to achieve controllable editing while introducing AffectSet to cover diverse scenarios.

What carries the argument

VA-Aware retrieval strategy that connects continuous affective values to concrete visual semantics for the editing process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The continuous VA approach could support real-time sliders for emotion adjustment in photo or video apps.
AffectSet may become a standard benchmark for testing other VA-based vision models beyond editing.
The retrieval-plus-guidance pattern might transfer to related tasks such as style transfer conditioned on affect.

Load-bearing premise

That the VA-Aware retrieval strategy can reliably bridge continuous affective values to detailed visual semantics without introducing artifacts or losing controllability in diverse scenes.

What would settle it

Human ratings or automated VA-prediction error on a held-out subset of AffectSet natural scenes, comparing MooD edits against discrete-emotion baselines for both affective match accuracy and visible artifacts.

Figures

Figures reproduced from arXiv: 2605.02521 by Hao Wang, Junxiao Xue, Meicong Si, Shi Chen, Tingqi Hu, Xinyi Yin, Xuecheng Wu, Yiduo Wang, Yunyun Shi.

**Figure 1.** Figure 1: The comparisons between (a) prompt-based AIE methods (e.g., ControlNet [8]) requiring lengthy descriptive inputs and (b) our MooD framework, which takes pure VA values as direct input for continuous AIE. systems, such affective modulation also benefits various social applications, such as public opinion analysis [5], humancomputer interaction [6], and healthcare [7]. This has given rise to the task conce… view at source ↗

**Figure 3.** Figure 3: The overall illustration of our MooD framework. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative Comparison with state-of-the-art methods. Our MooD outperforms existing approaches in both affective [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: The visualization of smooth affective transitions achieved by our MooD in the VA space. Each row (left to right) reflects [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: The qualitative ablation results of Key Components in our MooD, demonstrating that both VA-Aware retrieval strategy [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of τ on our VA-Aware retrieval strategy. systematic ablation study is conducted from four aspects: the roles of the main components, the VA retrieval threshold τ , the number of output tokens N in ASP-Net, and the weights of the loss functions (α, β). Component contribution analysis. As shown in Table IV, directly using VA values as prompts for SDXL achieves the highest PSNR and SSIM but results in … view at source ↗

read the original abstract

Affective Image Editing (AIE) aims to modify visual content to evoke targeted emotions. Although current approaches achieve impressive editing quality, they often overlook inference efficiency, which limits their applicability in computational social scenarios. Moreover, most methods depend on discrete emotion representations, which hinder the continuous modeling of complex human emotions and constrain expressive capabilities in interactive scenarios. To tackle these gaps, we propose MooD, the first framework that directly leverages continuous Valence-Arousal (VA) values as editing instruction for fine-grained and efficient AIE in computational social systems. Specifically, we first introduce a VA-Aware retrieval strategy to bridge vague affective values and detailed visual semantics. Building upon this, MooD integrates visual transfer and perception-enhanced semantic guidance to achieve controllable AIE. Furthermore, considering that existing VA-annotated datasets mainly focus on social scenarios and largely overlook natural scenes, we therefore construct AffectSet, a comprehensive VA-annotated dataset covering diverse scenarios, to support model optimization and evaluation. Extensive qualitative and quantitative experimental results demonstrate that our MooD achieves superior performance in both affective controllability and visual fidelity while maintaining high efficiency. A series of ablation studies further reveal the crucial factors of our design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MooD shifts affective image editing to continuous valence-arousal inputs with a new dataset, but the performance edge rests on unshown metrics and untested retrieval generalization.

read the letter

The main takeaway is that this paper is the first to treat continuous valence-arousal values as direct editing instructions rather than discrete emotion labels, paired with a new AffectSet dataset that adds natural scenes to the usual social ones. That framing addresses two clear limits in prior work: the inability to model emotion gradations and the focus on slow, high-quality edits that do not scale to interactive use. The VA-Aware retrieval plus perception-enhanced guidance is a straightforward engineering response to those limits, and it avoids starting every edit from a blank slate. The dataset construction itself is a useful byproduct that future papers can cite or extend. The paper does well at naming the efficiency and continuity problems explicitly and at proposing a retrieval step that could keep compute low while still following the target VA point. Those choices feel grounded in the stated goals. The soft spots are in the validation. The abstract states superior controllability, fidelity, and efficiency, yet supplies no concrete numbers, baseline tables, or error breakdowns. Without those, it is impossible to judge whether the retrieval actually preserves semantics across the full VA range or whether artifacts appear in scenes outside the new dataset. The stress-test concern about generalization is reasonable here; nearest-neighbor lookup in embedding space can fail on interpolated or out-of-distribution VA values, and the paper would need interpolation tests or consistency scores to close that gap. This work is for computer-vision researchers who build editing tools for social or interactive systems and who need practical efficiency gains more than theoretical novelty. A reader working on affective computing pipelines could borrow the retrieval idea or the dataset. I would send it to peer review. The core direction is coherent and the gaps it targets are real, so referees can push for the missing quantitative checks.

Referee Report

2 major / 2 minor

Summary. The paper proposes MooD, the first framework for affective image editing that directly uses continuous valence-arousal (VA) values as editing instructions. It introduces a VA-Aware retrieval strategy to bridge affective values with visual semantics, combines this with visual transfer and perception-enhanced semantic guidance for controllable editing, and constructs the AffectSet dataset covering diverse (including natural) scenes to address limitations in prior VA-annotated data. The authors claim superior affective controllability, visual fidelity, and inference efficiency over existing methods, supported by qualitative/quantitative experiments and ablation studies.

Significance. If the central claims hold, MooD would advance affective image editing by enabling fine-grained continuous control over emotions without discrete categories, while improving efficiency for computational social applications. The AffectSet dataset could serve as a useful resource for VA modeling in underrepresented scenes.

major comments (2)

[VA-Aware retrieval strategy (and associated experiments)] The central claim of superior controllability and fidelity rests on the VA-Aware retrieval strategy successfully mapping continuous VA inputs to detailed semantics without artifacts. The manuscript provides no quantitative retrieval metrics (e.g., interpolation error, semantic consistency, or accuracy across the VA continuum) or generalization tests for out-of-distribution VA values or underrepresented scenes, leaving the bridging assumption unverified and load-bearing for the efficiency/controllability advantages.
[Experimental results] §4 (experimental results): The abstract states that quantitative results demonstrate superiority, yet no specific metrics, baseline comparisons, error bars, or implementation details (e.g., retrieval implementation, training hyperparameters) appear in the provided text. This prevents assessment of whether reported gains in controllability/fidelity are statistically meaningful or artifact-free.

minor comments (2)

[Method] Clarify the exact retrieval mechanism (nearest-neighbor, embedding similarity, etc.) and any learned components in the VA-Aware strategy to allow reproducibility.
[Discussion or Conclusion] Add explicit discussion of limitations, including potential failure modes for extreme VA values or complex natural scenes not well-represented in AffectSet.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to strengthen the presentation of the VA-Aware retrieval strategy and the experimental results.

read point-by-point responses

Referee: [VA-Aware retrieval strategy (and associated experiments)] The central claim of superior controllability and fidelity rests on the VA-Aware retrieval strategy successfully mapping continuous VA inputs to detailed semantics without artifacts. The manuscript provides no quantitative retrieval metrics (e.g., interpolation error, semantic consistency, or accuracy across the VA continuum) or generalization tests for out-of-distribution VA values or underrepresented scenes, leaving the bridging assumption unverified and load-bearing for the efficiency/controllability advantages.

Authors: We agree that direct quantitative validation of the VA-Aware retrieval would make the central claims more robust. While the end-to-end affective editing results (controllability and fidelity) provide indirect support for the retrieval mapping, we will add explicit quantitative retrieval metrics in the revised manuscript. These will include interpolation error across the continuous VA space, semantic consistency via CLIP-based feature similarity between retrieved and target semantics, retrieval accuracy stratified by VA regions, and generalization tests on out-of-distribution VA values as well as underrepresented natural scenes from AffectSet. These results will be reported in an expanded experimental section or dedicated subsection. revision: yes
Referee: [Experimental results] §4 (experimental results): The abstract states that quantitative results demonstrate superiority, yet no specific metrics, baseline comparisons, error bars, or implementation details (e.g., retrieval implementation, training hyperparameters) appear in the provided text. This prevents assessment of whether reported gains in controllability/fidelity are statistically meaningful or artifact-free.

Authors: We apologize for any lack of clarity in the text provided to the referee. The full manuscript contains quantitative comparisons, but to address this concern directly we will expand §4 with explicit numerical metrics (affective controllability scores, FID/perceptual similarity for fidelity, and FPS for efficiency), side-by-side baseline comparisons against prior AIE methods, error bars computed over multiple random seeds, and full implementation details including the VA-aware retrieval procedure, training hyperparameters, optimizer settings, and dataset usage. This will enable readers to fully assess statistical significance and reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; claims rest on independent strategy and dataset

full rationale

The paper introduces MooD as a new framework leveraging continuous VA values via a VA-Aware retrieval strategy and constructs AffectSet as a new dataset to address prior limitations. No equations, derivations, or fitted parameters are shown that reduce by construction to inputs, and no self-citations are invoked as load-bearing for uniqueness theorems or ansatzes. Performance claims are supported by described experiments rather than self-referential fitting, making the derivation chain self-contained against external benchmarks with no identifiable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the VA-aware retrieval and perception-enhanced guidance are presented as novel components without stated assumptions or fitted constants.

pith-pipeline@v0.9.0 · 5537 in / 1005 out tokens · 32314 ms · 2026-05-14T20:53:35.476495+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 4 internal anchors

[1]

Avf-mae++: Scaling affective video facial masked autoencoders via efficient audio-visual self-supervised learning,

X. Wu, H. Sun, Y . Wang, J. Nie, J. Zhang, Y . Wang, J. Xue, and L. He, “Avf-mae++: Scaling affective video facial masked autoencoders via efficient audio-visual self-supervised learning,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 9142–

work page 2025
[2]

Towards emotion analysis in short-form videos: A large-scale dataset and baseline,

X. Wu, H. Sun, J. Xue, J. Nie, X. Kong, R. Zhai, D. Huang, and L. He, “Towards emotion analysis in short-form videos: A large-scale dataset and baseline,” inProceedings of the 2025 International Conference on Multimedia Retrieval, 2025, pp. 1497–1506. 1

work page 2025
[3]

Meas: Multimodal emotion analysis system for short videos on social media platforms,

Q. Wei, Y . Zhou, S. Xiang, L. Xiao, and Y . Zhang, “Meas: Multimodal emotion analysis system for short videos on social media platforms,” IEEE Transactions on Computational Social Systems, vol. 12, no. 5, pp. 2398–2410, 2025. 1

work page 2025
[4]

Fine-grained emotion compre- hension: Semisupervised multimodal emotion and intensity recognition,

Z. Fang, Z. Liu, T. Liu, and C.-C. Hung, “Fine-grained emotion compre- hension: Semisupervised multimodal emotion and intensity recognition,” IEEE Transactions on Computational Social Systems, vol. 12, no. 3, pp. 1145–1163, 2025. 1

work page 2025
[5]

Improving multimodal sentiment analysis via modality optimization and dynamic primary modality selection,

D. Yang, M. Li, X. Wu, Z. Chen, K. Jiang, K. Liu, P. Zhai, and L. Zhang, “Improving multimodal sentiment analysis via modality optimization and dynamic primary modality selection,”arXiv preprint arXiv:2511.06328, 2025. 1

work page arXiv 2025
[6]

Towards context-aware emotion recognition debiasing from a causal demystifi- cation perspective via de-confounded training,

D. Yang, K. Yang, H. Kuang, Z. Chen, Y . Wang, and L. Zhang, “Towards context-aware emotion recognition debiasing from a causal demystifi- cation perspective via de-confounded training,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10 663– 10 680, 2024. 1

work page 2024
[7]

A multimodal sentiment analysis approach based on multiview cross-modal fusion,

Y . Zhi, J. Li, H. Wang, and J. Chen, “A multimodal sentiment analysis approach based on multiview cross-modal fusion,”IEEE Transactions on Computational Social Systems, vol. 13, no. 1, pp. 136–151, 2026. 1

work page 2026
[8]

Adding conditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3813–3824. 1, 3, 6

work page 2023
[9]

Emotional image color transfer via deep learning,

D. Liu, Y . Jiang, M. Pei, and S. Liu, “Emotional image color transfer via deep learning,”Pattern Recognition Letters, vol. 110, pp. 16–22, 2018. 1

work page 2018
[10]

Texture-aware emotional color transfer between images,

S. Liu and M. Pei, “Texture-aware emotional color transfer between images,”IEEE Access, vol. 6, pp. 31 375–31 386, 2018. 1, 3 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS 10

work page 2018
[11]

Affective image filter: Reflecting emotions from text to images,

S. Weng, P. Zhang, Z. Chang, X. Wang, S. Li, and B. Shi, “Affective image filter: Reflecting emotions from text to images,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 810–10 819. 1, 6

work page 2023
[12]

Language-driven artistic style transfer,

T.-J. Fu, X. E. Wang, and W. Y . Wang, “Language-driven artistic style transfer,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 717–734. 1, 6

work page 2022
[13]

Perceptually inspired real-time artistic style transfer for video stream,

D. Kang, F. Tian, and S. Seo, “Perceptually inspired real-time artistic style transfer for video stream,”Journal of Real-Time Image Processing, vol. 13, no. 3, pp. 581–589, 2017. 1

work page 2017
[14]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695. 1, 3

work page 2022
[15]

Sdxl: Improving latent diffusion models for high-resolution image synthesis,

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” 2023. 1, 2, 3, 5

work page 2023
[16]

Emoedit: Evoking emotions through image manipulation,

J. Yang, J. Feng, W. Luo, D. Lischinski, D. Cohen-Or, and H. Huang, “Emoedit: Evoking emotions through image manipulation,” inProceed- ings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24 690–24 699. 1, 2, 3, 6

work page 2025
[17]

Emokgedit: Training-free affective injection via visual cue transformation,

J. Zhang and B. Fan, “Emokgedit: Training-free affective injection via visual cue transformation,”arXiv preprint arXiv:2601.12326, 2026. 1

work page arXiv 2026
[18]

Make me happier: Evoking emotions through image diffusion models,

Q. Lin, J. Zhang, Y .-S. Ong, and M. Zhang, “Make me happier: Evoking emotions through image diffusion models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 16 367–16 376. 1, 3, 6

work page 2025
[19]

Towards comprehensive interactive change under- standing in remote sensing: A large-scale dataset and dual-granularity enhanced vlm,

J. Xue, Q. Deng, X. Wu, K. Yao, X. Yin, F. Yu, W. Zhou, Y . Zhong, Y . Liu, and D. Yang, “Towards comprehensive interactive change under- standing in remote sensing: A large-scale dataset and dual-granularity enhanced vlm,”IEEE Transactions on Geoscience and Remote Sensing,

work page
[20]

Vic-bench: Benchmarking visual-interleaved chain-of-thought capability in mllms with free-style intermediate state representations,

X. Wu, J. Liu, D. Huang, X. Li, Y . Wang, C. Chen, L. Ma, X. Cao, and J. Xue, “Vic-bench: Benchmarking visual-interleaved chain-of-thought capability in mllms with free-style intermediate state representations,” arXiv preprint arXiv:2505.14404, 2025. 1

work page arXiv 2025
[21]

Hkd4vlm: A progressive hybrid knowledge distillation framework for robust mul- timodal hallucination and factuality detection in vlms,

Z. Zhang, X. Wu, D. Huang, S. Yan, C. Peng, and X. Cao, “Hkd4vlm: A progressive hybrid knowledge distillation framework for robust mul- timodal hallucination and factuality detection in vlms,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 13 881–13 887. 1

work page 2025
[22]

Emoagent: A multi-agent framework for diverse affective image manipulation,

Q. Mao, H. Hu, Y . He, D. Gao, H. Chen, and L. Jin, “Emoagent: A multi-agent framework for diverse affective image manipulation,”IEEE Transactions on Affective Computing, pp. 1–18, 2026. 1, 3

work page 2026
[23]

Towards llm-centric affective visual customization via efficient and precise emotion manipulating

J. Luo, X. Gu, J. Wang, and J. Lu, “Towards llm-centric affective visual customization via efficient and precise emotion manipulating.” Association for Computing Machinery, 2026, p. 1696–1704. 1

work page 2026
[24]

A circumplex model of affect

J. A. Russell, “A circumplex model of affect.”Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980. 2, 3

work page 1980
[25]

Paint by example: Exemplar-based image editing with diffusion models,

B. Yang, S. Gu, B. Zhang, T. Zhang, X. Chen, X. Sun, D. Chen, and F. Wen, “Paint by example: Exemplar-based image editing with diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 18 381–18 391. 2, 3

work page 2023
[26]

Emoticrafter: Text-to-emotional-image generation based on valence-arousal model,

S. Dang, Y . He, L. Ling, Z. Qian, N. Zhao, and N. Cao, “Emoticrafter: Text-to-emotional-image generation based on valence-arousal model,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 15 218–15 228. 2, 3

work page 2025
[27]

Emoset: A large-scale visual emotion dataset with rich attributes,

J. Yang, Q. Huang, T. Ding, D. Lischinski, D. Cohen-Or, and H. Huang, “Emoset: A large-scale visual emotion dataset with rich attributes,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 20 383–20 394. 2, 5

work page 2023
[28]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Emotional category data on images from the international affective picture system,

J. A. Mikels, B. L. Fredrickson, G. R. Larkin, C. M. Lindberg, S. J. Maglio, and P. A. Reuter-Lorenz, “Emotional category data on images from the international affective picture system,”Behavior research methods, vol. 37, no. 4, pp. 626–630, 2005. 3

work page 2005
[30]

A general psychoevolutionary theory of emotion,

R. Plutchik, “A general psychoevolutionary theory of emotion,” in Theories of Emotion, R. Plutchik and H. Kellerman, Eds. Academic Press, 1980, pp. 3–33. 3

work page 1980
[31]

Three dimensions of emotion

H. Schlosberg, “Three dimensions of emotion.”Psychological review, vol. 61, no. 2, p. 81, 1954. 3

work page 1954
[32]

A study of colour emotion and colour preference. part i: Colour emotions for single colours,

L.-C. Ou, M. R. Luo, A. Woodcock, and A. Wright, “A study of colour emotion and colour preference. part i: Colour emotions for single colours,”Color Research & Application, vol. 29, no. 3, pp. 232–240,

work page
[33]

Affective image classification using features inspired by psychology and art theory,

J. Machajdik and A. Hanbury, “Affective image classification using features inspired by psychology and art theory,” inProceedings of the 18th ACM international conference on Multimedia, 2010, pp. 83–92. 3

work page 2010
[34]

emotions: A large-scale dataset and audio- visual fusion network for emotion analysis in short-form videos,

X. Wu, D. Yang, D. Huang, X. Yin, Y . Wang, J. Zhang, J. Nie, L. Fu, Y . Liu, J. Xueet al., “emotions: A large-scale dataset and audio- visual fusion network for emotion analysis in short-form videos,”arXiv preprint arXiv:2508.06902, 2025. 3

work page arXiv 2025
[35]

Emotion recognition in context,

R. Kosti, J. M. Alvarez, A. Recasens, and A. Lapedriza, “Emotion recognition in context,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1667–1675. 3

work page 2017
[36]

Emotion schemas are embedded in the human visual system,

P. A. Kragel, M. C. Reddan, K. S. LaBar, and T. D. Wager, “Emotion schemas are embedded in the human visual system,”Science advances, vol. 5, no. 7, p. eaaw4358, 2019. 3

work page 2019
[37]

Prompt-to-prompt image editing with cross attention control,

A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y . Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross attention control,” 2022. 3

work page 2022
[38]

Instructpix2pix: Learning to follow image editing instructions,

T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 18 392– 18 402. 3, 6

work page 2023
[39]

Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models,

H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, “Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models,”

work page
[40]

Msnet: A deep architec- ture using multi-sentiment semantics for sentiment-aware image style transfer,

S. Sun, J. Jia, H. Wu, Z. Ye, and J. Xing, “Msnet: A deep architec- ture using multi-sentiment semantics for sentiment-aware image style transfer,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5. 3

work page 2023
[41]

Emostyle: Emotion-driven image stylization,

J. Yang, Z. Bai, and H. Huang, “Emostyle: Emotion-driven image stylization,” 2025. 3

work page 2025
[42]

Moodifier: Mllm-enhanced emotion-driven image editing,

J. Ye and S. X. Huang, “Moodifier: Mllm-enhanced emotion-driven image editing,” 2025. 3

work page 2025
[43]

Instruction-based image editing with planning, reasoning, and generation,

L. Ji, C. Qi, and Q. Chen, “Instruction-based image editing with planning, reasoning, and generation,” in2025 IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 17 506–17 515. 3

work page 2025
[44]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inICML, 2021. 4

work page 2021
[45]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. 5

work page 2016
[46]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

C. Meng, Y . He, Y . Song, J. Song, J. Wu, J.-Y . Zhu, and S. Ermon, “Sdedit: Guided image synthesis and editing with stochastic differential equations,”arXiv preprint arXiv:2108.01073, 2021. 6

work page internal anchor Pith review Pith/arXiv arXiv 2021
[47]

Zero-shot image-to-image translation,

G. Parmar, K. K. Singh, R. Zhang, Y . Li, J. Lu, and J.-Y . Zhu, “Zero-shot image-to-image translation,” 2023. 6

work page 2023
[48]

Step1X-Edit: A Practical Framework for General Image Editing

S. Liu, Y . Han, P. Xing, F. Yin, R. Wang, W. Cheng, J. Liao, Y . Wang, H. Fu, C. Han, G. Li, Y . Peng, Q. Sun, J. Wu, Y . Cai, Z. Ge, R. Ming, L. Xia, X. Zeng, Y . Zhu, B. Jiao, X. Zhang, G. Yu, and D. Jiang, “Step1x- edit: A practical framework for general image editing,”arXiv preprint arXiv:2504.17761, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Dreamomni2: Multimodal instruction-based editing and generation,

B. Xia, B. Peng, Y . Zhang, J. Huang, J. Liu, J. Li, H. Tan, S. Wu, C. Wang, Y . Wang, X. Wu, B. Yu, and J. Jia, “Dreamomni2: Multimodal instruction-based editing and generation,” 2025. 6

work page 2025
[50]

Emogen: Emotional image content generation with text-to-image diffusion models,

J. Yang, J. Feng, and H. Huang, “Emogen: Emotional image content generation with text-to-image diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6358–6368. 5

work page 2024
[51]

W. G. Cochran,Sampling techniques. john wiley & sons, 1977. 5

work page 1977
[52]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K ¨opf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala,PyTorch: an imperative style, high- performance deep learning library. Curran Associates Inc., 2019. 6

work page 2019
[53]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017