arxiv: 2604.10454 · v1 · submitted 2026-04-12 · 💻 cs.CV

Recognition: unknown

AIM-Bench: Benchmarking and Improving Affective Image Manipulation via Fine-Grained Hierarchical Control

Shi Chen , Xuecheng Wu , Heli Sun , Yunyun Shi , Xinyi Yin , Fengjian Xue , Jinheng Xie , Dingkang Yang

show 3 more authors

Hao Wang Junxiao Xue Liang He

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords affective image manipulationimage editing benchmarkemotion-guided editingpositivity biasinstruction tuning datasetvalence-arousal-dominancehuman-in-the-loop curation

0 comments

The pith

A new benchmark for affective image manipulation reveals positivity bias in current models and shows a 40k-sample dataset reduces it with a 9.15 percent gain.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AIM-Bench, the first benchmark built specifically to test image edits that target emotions instead of objects or scenes. It combines broad emotion categories with continuous valence-arousal-dominance values, gathers 800 high-quality examples through layered human review, and scores results with both fixed rules and learned models for consistency, visual quality, and emotional strength. Existing editors consistently drift toward positive outcomes because their training data over-represents pleasant images. To correct the imbalance the authors generate AIM-40k by redrawing source images into accurate emotional targets and creating precise cross-emotion instructions. Fine-tuning a standard model on this set lifts overall performance by 9.15 percent.

Core claim

Current image editing models exhibit a strong positivity bias when asked to perform affective manipulations because their training distributions favor pleasant scenes. The authors correct this by constructing a 40,000-sample instruction-tuning dataset through an inverse repainting process that produces high-fidelity ground-truth images paired with divergent-emotion inputs and exact instructions, then demonstrate that fine-tuning on these pairs raises composite performance 9.15 percent above the untuned baseline.

What carries the argument

The inverse repainting strategy that enhances raw affective images into high-fidelity ground truths and then synthesizes opposing-emotion input images together with paired precise instructions.

Load-bearing premise

The 800 human-curated samples together with the composite rule-based and model-based metrics accurately reflect human perception of emotional changes and instruction following.

What would settle it

A new human study that rates the same model outputs and finds either no reduction in positivity bias or no overall score gain after fine-tuning on AIM-40k.

Figures

Figures reproduced from arXiv: 2604.10454 by Dingkang Yang, Fengjian Xue, Hao Wang, Heli Sun, Jinheng Xie, Junxiao Xue, Liang He, Shi Chen, Xinyi Yin, Xuecheng Wu, Yunyun Shi.

**Figure 1.** Figure 1: Performance overview of expert editing models and unified large multimodal models on introduced AIM-Bench. 1. Introduction Affective images play a crucial role in online communication and web expression, serving as powerful tools to convey emotions and influence perceptions (Wu et al., 2025c; Yang et al., 2025a), with broad applications in intelligent creation and emotion-aware AI systems (Wu et al., 2025… view at source ↗

**Figure 2.** Figure 2: The overview and motivation of proposed AIM-Bench. This benchmark comprises 800 high-quality samples, which cover 8 emotional categories and 5 image editing types. can they quantify the precise degree of emotional transfer. This inability to differentiate emotional intensity severely limits fine-grained assessment and hinders continuous improvement in AIM research. To tackle this gap, we introduce AIM-Ben… view at source ↗

**Figure 3.** Figure 3: Overview of the AIM-Bench benchmark construction pipeline. This pipeline consists of three main stages: (1) Data Collection: clustering-based stratified sampling from EmoSet-118k to select 800 representative images; (2) Emotion-Targeted Image Editing: generating emotion targets with VAD coordinates, creating diverse editing instructions, and producing candidate images; (3) Hierarchical Controlled Filtering… view at source ↗

**Figure 4.** Figure 4: The construction pipeline of AIM-40k. We introduce a rigorous repainting and inverse editing workflow to ensure both visual fidelity and affective alignment. visual similarity (Svs). Therefore, it can be formulated as: SGTC = min(Ssm, Svs). (5) 4.2. Aesthetic Quality This dimension assesses the overall visual appeal and the absence of artifacts. We compute the aesthetic score A(Ig) using the improved-aesth… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of emotion editing across representative models. We visualize eight emotion categories (Rows 1-4: positive emotions, i.e., Amusement, Awe, Contentment, Excitement. Rows 5-8: negative emotions, i.e., Disgust, Sadness, Fear, Anger) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: The baseline exhibits a clear positivity bias: while strong on positive emotions, it suffers from substantial offdiagonal confusion among negative categories, collapsing distinct negative states into ambiguous clusters. In contrast, our tuned baseline model displays a pronounced diagonal concentration across all classes. With diverse and balanced supervision, AIM-40k enables learning discriminative cues, … view at source ↗

read the original abstract

Affective Image Manipulation (AIM) aims to evoke specific emotions through targeted editing. Current image editing benchmarks primarily focus on object-level modifications in general scenarios, lacking the fine-grained granularity to capture affective dimensions. To bridge this gap, we introduce the first benchmark designed for AIM termed AIM-Bench. This benchmark is built upon a dual-path affective modeling scheme that integrates the Mikels emotion taxonomy with the Valence-Arousal-Dominance framework, enabling high-level semantic and fine-grained continuous manipulation. Through a hierarchical human-in-the-loop workflow, we finally curate 800 high-quality samples covering 8 emotional categories and 5 editing types. To effectively assess performance, we also design a composite evaluation suite combining rule-based and model-based metrics to holistically assess instruction consistency, aesthetics, and emotional expressiveness. Extensive evaluations reveal that current editing models face significant challenges, most notably a prevalent positivity bias, which stemming from inherent imbalances in training data distribution. To tackle this, we propose a scalable data engine utilizing an inverse repainting strategy to construct AIM-40k, a balanced instruction-tuning dataset comprising 40k samples. Concretely, we enhance raw affective images via generative redrawing to establish high-fidelity ground truths, and synthesize input images with divergent emotions and paired precise instructions. Fine-tuning a baseline model on AIM-40k yields a 9.15% relative improvement in overall performance, demonstrating the effectiveness of our AIM-40k. Our data and related code will be made open soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AIM-Bench fills a real gap with the first affective-focused editing benchmark and a practical data engine, but the 9.15% gain rests on unvalidated composite metrics.

read the letter

The paper's core advance is a benchmark and dataset explicitly for affective image manipulation. It pairs Mikels taxonomy with valence-arousal-dominance to support both categorical and continuous control, curates 800 human-reviewed samples across eight emotions and five edit types, and releases a 40k instruction set built by inverse repainting to correct the positivity bias they document in existing training data. Fine-tuning a baseline on that set produces the reported 9.15% lift. That combination is new; earlier editing benchmarks stayed at object or scene level and did not target emotional outcomes or supply balanced affective instructions at this scale. The data-generation approach looks workable for anyone who needs more emotion-balanced pairs without starting from scratch.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces AIM-Bench, the first benchmark for Affective Image Manipulation (AIM), built on a dual-path scheme integrating Mikels' emotion taxonomy with the Valence-Arousal-Dominance (VAD) framework. It curates 800 samples via hierarchical human-in-the-loop workflow across 8 emotional categories and 5 editing types, and proposes a composite evaluation suite of rule-based and model-based metrics for instruction consistency, aesthetics, and emotional expressiveness. The work identifies a positivity bias in existing models stemming from training data imbalances, constructs AIM-40k (40k balanced samples) via an inverse repainting data engine, and reports that fine-tuning a baseline on AIM-40k yields a 9.15% relative improvement on AIM-Bench.

Significance. If the metrics and ground-truth curation are validated, the benchmark and dataset would fill a clear gap in fine-grained affective control for image editing and provide a concrete path to mitigate data-distribution biases. The scalable data engine, explicit identification of the positivity bias, and commitment to open-sourcing data and code are concrete strengths that would increase the work's utility to the community.

major comments (3)

[Composite evaluation suite] Composite evaluation suite (Abstract and results section): the headline 9.15% relative improvement is measured exclusively with the authors' composite rule-based plus model-based metrics, yet no correlation analysis, human preference study, or calibration against human affective judgments is reported. This directly undermines confidence that the delta reflects genuine gains in emotional expressiveness and instruction consistency rather than metric artifacts.
[AIM-Bench curation workflow] AIM-Bench curation workflow (Abstract and §3): the claim that the hierarchical human-in-the-loop process produces 800 high-fidelity, bias-free ground truths is asserted without inter-annotator agreement statistics, sensitivity analysis, or validation that the synthetic ground truths preserve affective fidelity without artifacts.
[AIM-40k fine-tuning results] AIM-40k fine-tuning results: the 9.15% relative improvement is presented without statistical significance testing, confidence intervals, or ablation isolating the contribution of the inverse-repainting strategy versus other factors, while the closed loop between the authors' data engine and the evaluation benchmark introduces moderate circularity risk.

minor comments (2)

[Abstract] Abstract contains a grammatical error ('which stemming from') that should be corrected for readability.
[Dual-path modeling] The exact definitions and weighting of the dual-path affective modeling scheme (Mikels + VAD) should be stated explicitly with equations or pseudocode in the early sections to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which identifies key areas where additional rigor will strengthen the manuscript. We address each major comment point by point below, providing clarifications where appropriate and committing to specific revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Composite evaluation suite] Composite evaluation suite (Abstract and results section): the headline 9.15% relative improvement is measured exclusively with the authors' composite rule-based plus model-based metrics, yet no correlation analysis, human preference study, or calibration against human affective judgments is reported. This directly undermines confidence that the delta reflects genuine gains in emotional expressiveness and instruction consistency rather than metric artifacts.

Authors: We agree that direct validation of the composite metrics against human affective judgments is essential to substantiate that the reported improvements reflect genuine gains rather than artifacts. The metrics combine established rule-based measures (for instruction consistency and aesthetics) with model-based predictors for emotional dimensions, informed by prior affective computing literature. However, the original submission did not include a dedicated human calibration study. In the revised manuscript, we will add a human preference study involving multiple annotators who rate edited outputs on emotional expressiveness, instruction adherence, and overall quality. We will report correlation analyses (e.g., Spearman rank correlations) between the composite metric scores and human ratings, along with inter-rater reliability measures, to calibrate and validate the metrics. revision: yes
Referee: [AIM-Bench curation workflow] AIM-Bench curation workflow (Abstract and §3): the claim that the hierarchical human-in-the-loop process produces 800 high-fidelity, bias-free ground truths is asserted without inter-annotator agreement statistics, sensitivity analysis, or validation that the synthetic ground truths preserve affective fidelity without artifacts.

Authors: The hierarchical human-in-the-loop workflow was structured with multiple stages of expert review to promote high fidelity and reduce bias. We acknowledge that the original manuscript did not report quantitative validation statistics for this process. To address this, the revision will include inter-annotator agreement statistics (e.g., Fleiss' kappa) computed across annotators for both emotional category assignments and editing type labels. We will also add a sensitivity analysis on key workflow parameters and a validation experiment comparing the curated ground truths against independent human affective judgments to confirm preservation of emotional fidelity and absence of artifacts. revision: yes
Referee: [AIM-40k fine-tuning results] AIM-40k fine-tuning results: the 9.15% relative improvement is presented without statistical significance testing, confidence intervals, or ablation isolating the contribution of the inverse-repainting strategy versus other factors, while the closed loop between the authors' data engine and the evaluation benchmark introduces moderate circularity risk.

Authors: We concur that statistical rigor and isolation of contributions are necessary for robust claims. In the revision, we will report statistical significance testing (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values) on the performance deltas, along with 95% confidence intervals around the 9.15% relative improvement. We will also include ablation studies that isolate the inverse-repainting strategy by comparing it against training on the original unbalanced data and alternative balancing approaches. On the circularity concern, we clarify that AIM-Bench curation was performed independently via a dedicated human-in-the-loop process on a distinct set of real images, while the inverse-repainting data engine was used solely to synthesize the separate AIM-40k training set from a different image pool. There is no sample overlap between the benchmark and the training data generation. The revision will expand the methods section to explicitly document this separation and provide supporting details to eliminate any perceived circularity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical improvement is measured on independent human-curated benchmark

full rationale

The central claim is an observed 9.15% relative gain after fine-tuning on AIM-40k, evaluated on the separately curated 800-sample AIM-Bench using a composite metric suite. The benchmark curation is described as hierarchical human-in-the-loop, the training data synthesis uses an inverse repainting engine, and the metrics combine rule-based and model-based components; none of these steps are shown by the paper's own text to reduce the reported delta to a definitional identity or self-citation chain. The result remains an external measurement rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the validity of established psychological emotion frameworks and the assumption that human-curated samples plus generative redrawing produce faithful affective ground truth; no new mathematical axioms or invented physical entities are introduced.

axioms (1)

domain assumption Mikels emotion taxonomy combined with the Valence-Arousal-Dominance framework provides an adequate dual-path model for fine-grained affective image editing
Invoked to define the hierarchical control scheme and sample curation criteria.

pith-pipeline@v0.9.0 · 5606 in / 1311 out tokens · 77676 ms · 2026-05-10T15:30:28.191658+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 5 internal anchors

[1]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

URL https://arxiv.org/abs/2512.02556. Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., and Fan, H. Emerg- ing properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing.arXiv preprint arXiv:2503.10639, 2025a

Doubao. Doubao-seededit-3.0, 2024a. Doubao. Doubao-seedream-4.0, 2024b. Fang, R., Duan, C., Wang, K., Huang, L., Li, H., Yan, S., Tian, H., Zeng, X., Zhao, R., Dai, J., et al. Got: Unleash- ing reasoning capability of multimodal large language model for visual generation and editing.arXiv preprint arXiv:2503.10639,

work page arXiv
[3]

arXiv preprint arXiv:2309.17102 (2023)

Fu, T.-J., Hu, W., Du, X., Wang, W. Y ., Yang, Y ., and Gan, Z. Guiding instruction-based image editing via multimodal large language models.arXiv preprint arXiv:2309.17102,

work page arXiv
[4]

arXiv preprint arXiv:2404.09990 , year=

Hui, M., Yang, S., Zhao, B., Shi, Y ., Wang, H., Wang, P., Zhou, Y ., and Xie, C. Hq-edit: A high-quality dataset for instruction-based image editing.arXiv preprint arXiv:2404.09990,

work page arXiv
[5]

doi: 10.18653/v1/2024.acl-long.663

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.663. Li, Z., Liu, Z., Zhang, Q., Lin, B., Yuan, S., Yan, Z., Ye, Y ., Yu, W., Niu, Y ., and Yuan, L. Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888,

work page doi:10.18653/v1/2024.acl-long.663 2024
[6]

Step1X-Edit: A Practical Framework for General Image Editing

Liu, S., Han, Y ., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y ., Fu, H., Han, C., Li, G., Peng, Y ., Sun, Q., Wu, J., Cai, Y ., Ge, Z., Ming, R., Xia, L., Zeng, X., Zhu, Y ., Jiao, B., Zhang, X., Yu, G., and Jiang, D. Step1x- edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761,

work page internal anchor Pith review arXiv
[7]

Null-text inversion for editing real im- ages using guided diffusion models.arXiv preprint arXiv:2211.09794,

Mokady, R., Hertz, A., Aberman, K., Pritch, Y ., and Cohen-Or, D. Null-text inversion for editing real im- ages using guided diffusion models.arXiv preprint arXiv:2211.09794,

work page arXiv
[8]

Qwen3 Technical Report

Team, G. Gemini-2.5-pro. https://deepmind. google/technologies/gemini/pro/, 2025a. Team, G. Gemini-2.5-flash-image. https://deepmind. google/technologies/gemini/flash/, 2025b. Team, G. Gemini-3-flash. https://deepmind. google/technologies/gemini/flash/, 2025c. Team, Q. Qwen3 technical report, 2025d. URL https: //arxiv.org/abs/2505.09388. Wang, S., Saharia...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y ., Chen, Y ., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y ., Zhang, Y ., Zhu, Y ., Wu, Y ., Cai,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Improving multimodal sentiment analysis via modality optimization and dynamic primary modality selection,

Yang, D., Li, M., Wu, X., Chen, Z., Jiang, K., Liu, K., Zhai, P., and Zhang, L. Improving multimodal sentiment analysis via modality optimization and dynamic primary modality selection.arXiv preprint arXiv:2511.06328, 2025a. Yang, J., Huang, Q., Ding, T., Lischinski, D., Cohen-Or, D., and Huang, H. Emoset: A large-scale visual emo- tion dataset with rich ...

work page arXiv
[11]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang, J., Feng, J., Luo, W., Lischinski, D., Cohen-Or, D., and Huang, H. Emoedit: Evoking emotions through image manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 24690– 24699, 2025b. Yang, L., Zhang, Z., Song, Y ., Hong, S., Xu, R., Zhao, Y ., Zhang, W., Cui, B., and Yang, M.-H. Diffusion models: A comprehensive s...

work page internal anchor Pith review arXiv
[12]

Mag- icbrush: A manually annotated dataset for instruction- guided image editing

Zhang, K., Mo, L., Chen, W., Sun, H., and Su, Y . Mag- icbrush: A manually annotated dataset for instruction- guided image editing. InAdvances in Neural Information Processing Systems, 2023a. Zhang, L., Rao, A., and Agrawala, M. Adding conditional control to text-to-image diffusion models, 2023b. Zhang, P., Weng, S., Zhu, C., Tang, B., Jia, Z., Li, S., an...

work page arXiv