Recognition: unknown
AIM-Bench: Benchmarking and Improving Affective Image Manipulation via Fine-Grained Hierarchical Control
Pith reviewed 2026-05-10 15:30 UTC · model grok-4.3
The pith
A new benchmark for affective image manipulation reveals positivity bias in current models and shows a 40k-sample dataset reduces it with a 9.15 percent gain.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Current image editing models exhibit a strong positivity bias when asked to perform affective manipulations because their training distributions favor pleasant scenes. The authors correct this by constructing a 40,000-sample instruction-tuning dataset through an inverse repainting process that produces high-fidelity ground-truth images paired with divergent-emotion inputs and exact instructions, then demonstrate that fine-tuning on these pairs raises composite performance 9.15 percent above the untuned baseline.
What carries the argument
The inverse repainting strategy that enhances raw affective images into high-fidelity ground truths and then synthesizes opposing-emotion input images together with paired precise instructions.
Load-bearing premise
The 800 human-curated samples together with the composite rule-based and model-based metrics accurately reflect human perception of emotional changes and instruction following.
What would settle it
A new human study that rates the same model outputs and finds either no reduction in positivity bias or no overall score gain after fine-tuning on AIM-40k.
Figures
read the original abstract
Affective Image Manipulation (AIM) aims to evoke specific emotions through targeted editing. Current image editing benchmarks primarily focus on object-level modifications in general scenarios, lacking the fine-grained granularity to capture affective dimensions. To bridge this gap, we introduce the first benchmark designed for AIM termed AIM-Bench. This benchmark is built upon a dual-path affective modeling scheme that integrates the Mikels emotion taxonomy with the Valence-Arousal-Dominance framework, enabling high-level semantic and fine-grained continuous manipulation. Through a hierarchical human-in-the-loop workflow, we finally curate 800 high-quality samples covering 8 emotional categories and 5 editing types. To effectively assess performance, we also design a composite evaluation suite combining rule-based and model-based metrics to holistically assess instruction consistency, aesthetics, and emotional expressiveness. Extensive evaluations reveal that current editing models face significant challenges, most notably a prevalent positivity bias, which stemming from inherent imbalances in training data distribution. To tackle this, we propose a scalable data engine utilizing an inverse repainting strategy to construct AIM-40k, a balanced instruction-tuning dataset comprising 40k samples. Concretely, we enhance raw affective images via generative redrawing to establish high-fidelity ground truths, and synthesize input images with divergent emotions and paired precise instructions. Fine-tuning a baseline model on AIM-40k yields a 9.15% relative improvement in overall performance, demonstrating the effectiveness of our AIM-40k. Our data and related code will be made open soon.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AIM-Bench, the first benchmark for Affective Image Manipulation (AIM), built on a dual-path scheme integrating Mikels' emotion taxonomy with the Valence-Arousal-Dominance (VAD) framework. It curates 800 samples via hierarchical human-in-the-loop workflow across 8 emotional categories and 5 editing types, and proposes a composite evaluation suite of rule-based and model-based metrics for instruction consistency, aesthetics, and emotional expressiveness. The work identifies a positivity bias in existing models stemming from training data imbalances, constructs AIM-40k (40k balanced samples) via an inverse repainting data engine, and reports that fine-tuning a baseline on AIM-40k yields a 9.15% relative improvement on AIM-Bench.
Significance. If the metrics and ground-truth curation are validated, the benchmark and dataset would fill a clear gap in fine-grained affective control for image editing and provide a concrete path to mitigate data-distribution biases. The scalable data engine, explicit identification of the positivity bias, and commitment to open-sourcing data and code are concrete strengths that would increase the work's utility to the community.
major comments (3)
- [Composite evaluation suite] Composite evaluation suite (Abstract and results section): the headline 9.15% relative improvement is measured exclusively with the authors' composite rule-based plus model-based metrics, yet no correlation analysis, human preference study, or calibration against human affective judgments is reported. This directly undermines confidence that the delta reflects genuine gains in emotional expressiveness and instruction consistency rather than metric artifacts.
- [AIM-Bench curation workflow] AIM-Bench curation workflow (Abstract and §3): the claim that the hierarchical human-in-the-loop process produces 800 high-fidelity, bias-free ground truths is asserted without inter-annotator agreement statistics, sensitivity analysis, or validation that the synthetic ground truths preserve affective fidelity without artifacts.
- [AIM-40k fine-tuning results] AIM-40k fine-tuning results: the 9.15% relative improvement is presented without statistical significance testing, confidence intervals, or ablation isolating the contribution of the inverse-repainting strategy versus other factors, while the closed loop between the authors' data engine and the evaluation benchmark introduces moderate circularity risk.
minor comments (2)
- [Abstract] Abstract contains a grammatical error ('which stemming from') that should be corrected for readability.
- [Dual-path modeling] The exact definitions and weighting of the dual-path affective modeling scheme (Mikels + VAD) should be stated explicitly with equations or pseudocode in the early sections to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which identifies key areas where additional rigor will strengthen the manuscript. We address each major comment point by point below, providing clarifications where appropriate and committing to specific revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Composite evaluation suite] Composite evaluation suite (Abstract and results section): the headline 9.15% relative improvement is measured exclusively with the authors' composite rule-based plus model-based metrics, yet no correlation analysis, human preference study, or calibration against human affective judgments is reported. This directly undermines confidence that the delta reflects genuine gains in emotional expressiveness and instruction consistency rather than metric artifacts.
Authors: We agree that direct validation of the composite metrics against human affective judgments is essential to substantiate that the reported improvements reflect genuine gains rather than artifacts. The metrics combine established rule-based measures (for instruction consistency and aesthetics) with model-based predictors for emotional dimensions, informed by prior affective computing literature. However, the original submission did not include a dedicated human calibration study. In the revised manuscript, we will add a human preference study involving multiple annotators who rate edited outputs on emotional expressiveness, instruction adherence, and overall quality. We will report correlation analyses (e.g., Spearman rank correlations) between the composite metric scores and human ratings, along with inter-rater reliability measures, to calibrate and validate the metrics. revision: yes
-
Referee: [AIM-Bench curation workflow] AIM-Bench curation workflow (Abstract and §3): the claim that the hierarchical human-in-the-loop process produces 800 high-fidelity, bias-free ground truths is asserted without inter-annotator agreement statistics, sensitivity analysis, or validation that the synthetic ground truths preserve affective fidelity without artifacts.
Authors: The hierarchical human-in-the-loop workflow was structured with multiple stages of expert review to promote high fidelity and reduce bias. We acknowledge that the original manuscript did not report quantitative validation statistics for this process. To address this, the revision will include inter-annotator agreement statistics (e.g., Fleiss' kappa) computed across annotators for both emotional category assignments and editing type labels. We will also add a sensitivity analysis on key workflow parameters and a validation experiment comparing the curated ground truths against independent human affective judgments to confirm preservation of emotional fidelity and absence of artifacts. revision: yes
-
Referee: [AIM-40k fine-tuning results] AIM-40k fine-tuning results: the 9.15% relative improvement is presented without statistical significance testing, confidence intervals, or ablation isolating the contribution of the inverse-repainting strategy versus other factors, while the closed loop between the authors' data engine and the evaluation benchmark introduces moderate circularity risk.
Authors: We concur that statistical rigor and isolation of contributions are necessary for robust claims. In the revision, we will report statistical significance testing (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values) on the performance deltas, along with 95% confidence intervals around the 9.15% relative improvement. We will also include ablation studies that isolate the inverse-repainting strategy by comparing it against training on the original unbalanced data and alternative balancing approaches. On the circularity concern, we clarify that AIM-Bench curation was performed independently via a dedicated human-in-the-loop process on a distinct set of real images, while the inverse-repainting data engine was used solely to synthesize the separate AIM-40k training set from a different image pool. There is no sample overlap between the benchmark and the training data generation. The revision will expand the methods section to explicitly document this separation and provide supporting details to eliminate any perceived circularity. revision: yes
Circularity Check
No significant circularity; empirical improvement is measured on independent human-curated benchmark
full rationale
The central claim is an observed 9.15% relative gain after fine-tuning on AIM-40k, evaluated on the separately curated 800-sample AIM-Bench using a composite metric suite. The benchmark curation is described as hierarchical human-in-the-loop, the training data synthesis uses an inverse repainting engine, and the metrics combine rule-based and model-based components; none of these steps are shown by the paper's own text to reduce the reported delta to a definitional identity or self-citation chain. The result remains an external measurement rather than a tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Mikels emotion taxonomy combined with the Valence-Arousal-Dominance framework provides an adequate dual-path model for fine-grained affective image editing
Reference graph
Works this paper leans on
-
[1]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
URL https://arxiv.org/abs/2512.02556. Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., and Fan, H. Emerg- ing properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Doubao. Doubao-seededit-3.0, 2024a. Doubao. Doubao-seedream-4.0, 2024b. Fang, R., Duan, C., Wang, K., Huang, L., Li, H., Yan, S., Tian, H., Zeng, X., Zhao, R., Dai, J., et al. Got: Unleash- ing reasoning capability of multimodal large language model for visual generation and editing.arXiv preprint arXiv:2503.10639,
-
[3]
arXiv preprint arXiv:2309.17102 (2023)
Fu, T.-J., Hu, W., Du, X., Wang, W. Y ., Yang, Y ., and Gan, Z. Guiding instruction-based image editing via multimodal large language models.arXiv preprint arXiv:2309.17102,
-
[4]
arXiv preprint arXiv:2404.09990 , year=
Hui, M., Yang, S., Zhao, B., Shi, Y ., Wang, H., Wang, P., Zhou, Y ., and Xie, C. Hq-edit: A high-quality dataset for instruction-based image editing.arXiv preprint arXiv:2404.09990,
-
[5]
doi: 10.18653/v1/2024.acl-long.663
Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.663. Li, Z., Liu, Z., Zhang, Q., Lin, B., Yuan, S., Yan, Z., Ye, Y ., Yu, W., Niu, Y ., and Yuan, L. Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888,
-
[6]
Step1X-Edit: A Practical Framework for General Image Editing
Liu, S., Han, Y ., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y ., Fu, H., Han, C., Li, G., Peng, Y ., Sun, Q., Wu, J., Cai, Y ., Ge, Z., Ming, R., Xia, L., Zeng, X., Zhu, Y ., Jiao, B., Zhang, X., Yu, G., and Jiang, D. Step1x- edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761,
work page internal anchor Pith review arXiv
-
[7]
Mokady, R., Hertz, A., Aberman, K., Pritch, Y ., and Cohen-Or, D. Null-text inversion for editing real im- ages using guided diffusion models.arXiv preprint arXiv:2211.09794,
-
[8]
Team, G. Gemini-2.5-pro. https://deepmind. google/technologies/gemini/pro/, 2025a. Team, G. Gemini-2.5-flash-image. https://deepmind. google/technologies/gemini/flash/, 2025b. Team, G. Gemini-3-flash. https://deepmind. google/technologies/gemini/flash/, 2025c. Team, Q. Qwen3 technical report, 2025d. URL https: //arxiv.org/abs/2505.09388. Wang, S., Saharia...
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y ., Chen, Y ., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y ., Zhang, Y ., Zhu, Y ., Wu, Y ., Cai,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Yang, D., Li, M., Wu, X., Chen, Z., Jiang, K., Liu, K., Zhai, P., and Zhang, L. Improving multimodal sentiment analysis via modality optimization and dynamic primary modality selection.arXiv preprint arXiv:2511.06328, 2025a. Yang, J., Huang, Q., Ding, T., Lischinski, D., Cohen-Or, D., and Huang, H. Emoset: A large-scale visual emo- tion dataset with rich ...
-
[11]
ImgEdit: A Unified Image Editing Dataset and Benchmark
Yang, J., Feng, J., Luo, W., Lischinski, D., Cohen-Or, D., and Huang, H. Emoedit: Evoking emotions through image manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 24690– 24699, 2025b. Yang, L., Zhang, Z., Song, Y ., Hong, S., Xu, R., Zhao, Y ., Zhang, W., Cui, B., and Yang, M.-H. Diffusion models: A comprehensive s...
work page internal anchor Pith review arXiv
-
[12]
Mag- icbrush: A manually annotated dataset for instruction- guided image editing
Zhang, K., Mo, L., Chen, W., Sun, H., and Su, Y . Mag- icbrush: A manually annotated dataset for instruction- guided image editing. InAdvances in Neural Information Processing Systems, 2023a. Zhang, L., Rao, A., and Agrawala, M. Adding conditional control to text-to-image diffusion models, 2023b. Zhang, P., Weng, S., Zhu, C., Tang, B., Jia, Z., Li, S., an...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.