pith. sign in

arxiv: 2507.23372 · v2 · pith:2RARSBWWnew · submitted 2025-07-31 · 💻 cs.CV

UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries

Pith reviewed 2026-05-25 08:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords emotional understandingemotional generationunified frameworklearnable expert queriesdiffusion modelhierarchical chaindata filteringjoint training
0
0 comments X

The pith

UniEmo unifies emotional understanding and generation by extracting multi-scale features with learnable expert queries that guide diffusion and receive dual feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that emotional understanding and generation are complementary tasks that can be integrated into one framework rather than handled separately. It introduces a hierarchical emotional understanding chain that uses learnable expert queries to progressively pull out multi-scale emotional features from images. These features then serve as input to guide a diffusion model toward creating emotion-evoking images, with added correlation coefficients and condition losses to improve alignment. Joint training lets the generation side supply implicit feedback to the understanding side, while a data filtering step on high-quality generated images supplies explicit feedback that further strengthens understanding. A reader would care because the mutual reinforcement could raise accuracy on both tasks without maintaining two independent systems.

Core claim

The central claim is that a unified framework extracts multi-scale emotional features via a hierarchical chain of learnable expert queries, fuses those queries and representations to steer a diffusion model, and uses joint training plus a data filtering algorithm to create generation-driven dual feedback loops that improve the understanding component in return.

What carries the argument

Hierarchical emotional understanding chain with learnable expert queries that progressively extracts multi-scale emotional features to serve as a shared foundation for both tasks.

If this is right

  • Fusing the expert queries and emotional representations with an emotional correlation coefficient and emotional condition loss improves diversity and fidelity of generated emotion-evoking images.
  • Joint training lets the generation component supply implicit feedback that strengthens the understanding component.
  • A data filtering algorithm applied to high-quality images from the trained model supplies explicit feedback that further raises understanding capacity.
  • The resulting model outperforms prior separate state-of-the-art methods on both emotional understanding and emotional generation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same query-based unification pattern could be tested on other paired vision tasks such as scene understanding paired with conditional image synthesis.
  • Iterating the dual feedback loop multiple times might produce further gains without additional labeled data.
  • The approach could lower the need for large separate training sets by letting generated images supplement understanding data.

Load-bearing premise

The assumption that multi-scale emotional features extracted by the learnable expert queries form a foundational step that benefits both understanding and generation tasks.

What would settle it

An experiment that replaces the learnable expert queries and hierarchical chain with standard fixed features and measures whether both understanding accuracy and generation quality drop or stay the same under joint training.

Figures

Figures reproduced from arXiv: 2507.23372 by Lingsen Zhang, Liqiang Nie, Rui Shao, Tao Tan, Yijie Zhu, Zitong Yu.

Figure 1
Figure 1. Figure 1: The pipelines for existing methods focused on Emotional Understanding and Emotional Image Content Generation are [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our UniEmo framework, which leverages two sets of learnable expert queries (scene tokens and object tokens) to capture hierarchical emotional representations (step 1 mentioned in the introduction). These tokens and emotional representations are fused with emotional correlation coefficients αScene and αObject, supervised by the emotional condition loss Lcond to guide the diffusion model in emoti… view at source ↗
Figure 3
Figure 3. Figure 3: The overall pipeline for calculating the emotional [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison with the state-of-the-art emotional generation methods. To facilitate a more comprehensive [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study on expert query in visual emotion understanding task across various backbone architectures [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of the attention maps for expert queries. It demonstrates that our expert queries can effectively attend to [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: (a) Emotion transfer, where we combine each of the learned emotion representations with several neutral semantics. [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
read the original abstract

Emotional understanding and generation are often treated as separate tasks, yet they are inherently complementary and can mutually enhance each other. In this paper, we propose the UniEmo, a unified framework that seamlessly integrates these two tasks. The key challenge lies in the abstract nature of emotions, necessitating the extraction of visual representations beneficial for both tasks. To address this, we propose a hierarchical emotional understanding chain with learnable expert queries that progressively extracts multi-scale emotional features, thereby serving as a foundational step for unification. Simultaneously, we fuse these expert queries and emotional representations to guide the diffusion model in generating emotion-evoking images. To enhance the diversity and fidelity of the generated emotional images, we further introduce the emotional correlation coefficient and emotional condition loss into the fusion process. This step facilitates fusion and alignment for emotional generation guided by the understanding. In turn, we demonstrate that joint training allows the generation component to provide implicit feedback to the understanding part. Furthermore, we propose a novel data filtering algorithm to select high-quality and diverse emotional images generated by the well-trained model, which explicitly feedback into the understanding part. Together, these generation-driven dual feedback processes enhance the model's understanding capacity. Extensive experiments show that UniEmo significantly outperforms state-of-the-art methods in both emotional understanding and generation tasks. The code for the proposed method is available at https://github.com/JiuTian-VL/UniEmo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes UniEmo, a unified framework for emotional understanding and generation tasks. It introduces a hierarchical emotional understanding chain using learnable expert queries to extract multi-scale emotional features, which are fused with emotional representations to condition a diffusion model for generation. An emotional correlation coefficient and emotional condition loss are added to improve generation quality, while dual feedback loops (implicit from generation and explicit via a novel data filtering algorithm) enable mutual enhancement between the two tasks. The manuscript reports that extensive experiments on standard benchmarks demonstrate significant outperformance over prior state-of-the-art methods on both understanding and generation, with code released at https://github.com/JiuTian-VL/UniEmo.

Significance. If the empirical results and mutual-enhancement claims hold, the work could meaningfully advance affective computing by providing a concrete mechanism for joint training of understanding and generation with shared representations and feedback. The public code release is a clear strength that supports reproducibility and extension by the community.

major comments (2)
  1. [Method section] The central unification claim rests on the hierarchical chain with learnable expert queries serving as a foundational step for both tasks (Method section). However, no ablation is reported that isolates the contribution of these queries versus a standard multi-scale feature extractor, making it unclear whether the reported gains on understanding benchmarks are attributable to this component or to joint training alone.
  2. [Experiments section] The dual feedback processes (implicit generation feedback and explicit data filtering) are presented as enhancing understanding capacity (Experiments section). The manuscript should report the performance delta on understanding metrics when the generation-driven feedback is disabled, as this directly tests the mutual-enhancement argument that underpins the unification.
minor comments (2)
  1. [Abstract] The abstract asserts outperformance without any numerical results, dataset names, or baseline references; while the full experimental section supplies these, the abstract should be updated for consistency with journal standards.
  2. [Method section] Notation for the emotional correlation coefficient and emotional condition loss should be introduced with explicit equations in the method section to improve clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the contributions of our proposed components. We address each major comment below and will revise the manuscript to include the requested ablations.

read point-by-point responses
  1. Referee: [Method section] The central unification claim rests on the hierarchical chain with learnable expert queries serving as a foundational step for both tasks (Method section). However, no ablation is reported that isolates the contribution of these queries versus a standard multi-scale feature extractor, making it unclear whether the reported gains on understanding benchmarks are attributable to this component or to joint training alone.

    Authors: We agree that an ablation isolating the learnable expert queries would strengthen the paper. The queries are designed to enable task-specific progressive extraction of multi-scale emotional features within the hierarchical chain. In the revised manuscript, we will add an ablation comparing our approach against a standard multi-scale feature extractor (e.g., a feature pyramid network with non-learnable queries) while keeping joint training fixed, to quantify the specific contribution of the learnable queries. revision: yes

  2. Referee: [Experiments section] The dual feedback processes (implicit generation feedback and explicit data filtering) are presented as enhancing understanding capacity (Experiments section). The manuscript should report the performance delta on understanding metrics when the generation-driven feedback is disabled, as this directly tests the mutual-enhancement argument that underpins the unification.

    Authors: We concur that this ablation is necessary to directly test the mutual-enhancement claim. We will add results in the revised Experiments section showing understanding benchmark performance when the generation-driven feedback (both implicit and explicit) is disabled, thereby reporting the performance delta attributable to these processes. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript introduces a new neural architecture (hierarchical emotional understanding chain with learnable expert queries, emotional correlation coefficient, dual feedback loops, and data filtering) and reports empirical gains on standard benchmarks. No equations, derivations, or first-principles claims are present that reduce by construction to fitted parameters, self-citations, or renamed inputs. The unification argument rests on the described training procedure and quantitative comparisons rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that understanding and generation are complementary, plus several new introduced components whose effectiveness is asserted rather than derived from prior results.

free parameters (2)
  • learnable expert queries
    Parameters optimized during training to extract multi-scale emotional features; their number and initialization are not specified in the abstract.
  • emotional correlation coefficient
    Coefficient introduced to control fusion for diversity and fidelity; its exact formulation and fitting procedure are not detailed.
axioms (1)
  • domain assumption Emotional understanding and generation are inherently complementary and can mutually enhance each other.
    Explicitly stated in the first sentence of the abstract as the motivation for unification.
invented entities (2)
  • learnable expert queries no independent evidence
    purpose: Progressively extract multi-scale emotional features for both tasks
    New mechanism proposed in the framework; no independent evidence outside the paper is provided in the abstract.
  • emotional correlation coefficient no independent evidence
    purpose: Enhance diversity and fidelity during fusion for generation
    New loss component introduced; no external validation mentioned.

pith-pipeline@v0.9.0 · 5794 in / 1381 out tokens · 47943 ms · 2026-05-25T08:11:28.933922+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Navigating the Emotion Tree: Hierarchical Hyperbolic RAG for Multimodal Emotion Recognition

    cs.LG 2026-05 unverdicted novelty 6.0

    HyperEmo-RAG uses hierarchical hyperbolic embeddings and graph-based evidence injection to outperform prior methods in multimodal emotion recognition.

  2. ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

  3. AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition

    cs.CV 2026-04 unverdicted novelty 5.0

    AffectAgent deploys a query planner, evidence filter, and emotion generator as collaborative agents trained via MAPPO with shared reward, plus MB-MoE and RAAF modules, to achieve superior multimodal emotion recognitio...

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 3 Pith papers · 11 internal anchors

  1. [1]

    Video ecommerce++: Toward large scale online video advertising,

    Z.-Q. Cheng, X. Wu, Y . Liu, and X.-S. Hua, “Video ecommerce++: Toward large scale online video advertising,” IEEE transactions on multimedia, vol. 19, no. 6, pp. 1170–1183, 2017

  2. [2]

    Video2shop: Exact matching clothes in videos to online shopping images,

    ——, “Video2shop: Exact matching clothes in videos to online shopping images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4048–4056

  3. [3]

    Detecting and grounding multi-modal media manipulation and beyond,

    R. Shao, T. Wu, J. Wu, L. Nie, and Z. Liu, “Detecting and grounding multi-modal media manipulation and beyond,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 2024

  4. [4]

    Detecting and grounding multi-modal media manipulation,

    R. Shao, T. Wu, and Z. Liu, “Detecting and grounding multi-modal media manipulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 6904–6913

  5. [5]

    Multi-adversarial discriminative deep domain generalization for face presentation attack detection,

    R. Shao, X. Lan, J. Li, and P. C. Yuen, “Multi-adversarial discriminative deep domain generalization for face presentation attack detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 023–10 031

  6. [6]

    Emotion recognition, emotion expres- sion, and cultural display rules: Implications for counseling

    A. Hutchison and L. Gerstein, “Emotion recognition, emotion expres- sion, and cultural display rules: Implications for counseling.” Journal of Asia Pacific Counseling, vol. 7, no. 1, 2017

  7. [7]

    Stimuli-aware visual emotion analysis,

    J. Yang, J. Li, X. Wang, Y . Ding, and X. Gao, “Stimuli-aware visual emotion analysis,” IEEE Transactions on Image Processing, vol. 30, pp. 7432–7445, 2021

  8. [8]

    Puma: Layer-pruned language model for efficient unified multimodal retrieval with modality-adaptive learning,

    Y . Lyu, R. Shao, G. Chen, Y . Zhu, W. Guan, and L. Nie, “Puma: Layer-pruned language model for efficient unified multimodal retrieval with modality-adaptive learning,” in Proceedings of the 33nd ACM International Conference on Multimedia , 2025

  9. [9]

    Analyzing emotional semantics of abstract art using low-level image features,

    H. Zhang, E. Augilius, T. Honkela, J. Laaksonen, H. Gamper, and H. Alene, “Analyzing emotional semantics of abstract art using low-level image features,” in Advances in Intelligent Data Analysis X , J. Gama, E. Bradley, and J. Hollm ´en, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 413–423

  10. [10]

    Learning deep features for image emotion classification,

    M. Chen, L. Zhang, and J. P. Allebach, “Learning deep features for image emotion classification,” in 2015 IEEE International Conference on Image Processing (ICIP) , 2015, pp. 4491–4495

  11. [11]

    Dependency exploitation: A unified cnn-rnn approach for visual emotion recognition,

    X. Zhu, L. Li, W. Zhang, T. Rao, M. Xu, Q. Huang, and D. Xu, “Dependency exploitation: A unified cnn-rnn approach for visual emotion recognition,” in International Joint Conference on Artificial Intelligence, 2017. [Online]. Available: https://api.semanticscholar.org/ CorpusID:4963251

  12. [12]

    Exploring discriminative representations for image emotion recognition with cnns,

    W. Zhang, X. He, and W. Lu, “Exploring discriminative representations for image emotion recognition with cnns,” IEEE Transactions on Mul- timedia, vol. 22, no. 2, pp. 515–523, 2020

  13. [13]

    Emotion recognition based on con- volutional neural network (cnn),

    B. G. K. Reddy, P. Yashwanthsaai, A. R. Raja, A. Jagarlamudi, N. Leeladhar, and T. T. Kumar, “Emotion recognition based on con- volutional neural network (cnn),” in 2021 International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA) , 2021, pp. 1–5

  14. [14]

    Solver: Scene-object interrelated visual emotion reasoning network,

    J. Yang, X. Gao, L. Li, X. Wang, and J. Ding, “Solver: Scene-object interrelated visual emotion reasoning network,” IEEE Transactions on Image Processing, vol. 30, pp. 8686–8701, 2021

  15. [15]

    Feallm: Advancing facial emotion analysis in multimodal large lan- guage models with emotional synergy and reasoning,

    Z. Hu, K. Yuan, X. Liu, Z. Yu, Y . Zong, J. Shi, H. Yue, and J. Yang, “Feallm: Advancing facial emotion analysis in multimodal large lan- guage models with emotional synergy and reasoning,” arXiv preprint arXiv:2505.13419, 2025

  16. [16]

    Emogen: Emotional image content generation with text-to-image diffusion models,

    J. Yang, J. Feng, and H. Huang, “Emogen: Emotional image content generation with text-to-image diffusion models,” in 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Los Alamitos, CA, USA: IEEE Computer Society, jun 2024, pp. 6358–6368. [Online]. Available: https://doi.ieeecomputersociety.org/10. 1109/CVPR52733.2024.00608

  17. [17]

    Current state of text sentiment analysis from opinion to emotion mining,

    A. Yadollahi, A. G. Shahraki, and O. R. Zaiane, “Current state of text sentiment analysis from opinion to emotion mining,” ACM Computing Surveys (CSUR), vol. 50, no. 2, pp. 1–33, 2017

  18. [18]

    Conceptualizing emotion in healthcare interpreting: A normative approach to interpreters’ emotion work,

    E. Hsieh and B. Nicodemus, “Conceptualizing emotion in healthcare interpreting: A normative approach to interpreters’ emotion work,” Patient Education and Counseling, vol. 98, no. 12, pp. 1474–1481, 2015

  19. [19]

    Learning to prompt for vision-language emotion recognition,

    H. Xie, H. Chung, H.-H. Shuai, and W.-H. Cheng, “Learning to prompt for vision-language emotion recognition,” in 2023 11th International Conference on Affective Computing and Intelligent Interaction Work- shops and Demos (ACIIW) . IEEE, 2023, pp. 1–4

  20. [20]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems , vol. 33, pp. 6840– 6851, 2020

  21. [21]

    High- resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2022, pp. 10 684–10 695

  22. [22]

    Diffusion models beat gans on image synthesis,

    P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in Neural Information Processing Systems, vol. 34, pp. 8780–8794, 2021

  23. [23]

    Adding conditional control to text-to-image diffusion models,

    L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 3836–3847

  24. [24]

    Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

    N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 22 500–22 510

  25. [25]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y . Gu, Z. Chen, Z. Yang, and M. Z. Shou, “Show-o: One single transformer to unify multimodal understanding and generation,” arXiv preprint arXiv:2408.12528, 2024

  26. [26]

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    C. Zhou, L. Yu, A. Babu, K. Tirumala, M. Yasunaga, L. Shamis, J. Kahn, X. Ma, L. Zettlemoyer, and O. Levy, “Transfusion: Predict the next token and diffuse images with one multi-modal model,” arXiv preprint arXiv:2408.11039, 2024

  27. [27]

    VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

    Y . Wu, Z. Zhang, J. Chen, H. Tang, D. Li, Y . Fang, L. Zhu, E. Xie, H. Yin, L. Yi et al. , “Vila-u: a unified foundation model integrating visual understanding and generation,” arXiv preprint arXiv:2409.04429, 2024

  28. [28]

    Emu3: Next-Token Prediction is All You Need

    X. Wang, X. Zhang, Z. Luo, Q. Sun, Y . Cui, J. Wang, F. Zhang, Y . Wang, Z. Li, Q. Yu et al., “Emu3: Next-token prediction is all you need,” arXiv preprint arXiv:2409.18869, 2024

  29. [29]

    ID3: Identity-Preserving-yet-Diversified Diffusion Models for Synthetic Face Recognition,

    S. Li, J. Xu, J. Wu, M. Xiong, A. Deng, J. Ji, Y . Huang, W. Feng, S. Ding, and B. Hooi, “ID3: Identity-Preserving-yet-Diversified Diffusion Models for Synthetic Face Recognition,” arXiv preprint arXiv:2409.17576, 2024

  30. [30]

    Diffusion feedback helps clip see better,

    W. Wang, Q. Sun, F. Zhang, Y . Tang, J. Liu, and X. Wang, “Diffusion feedback helps clip see better,” arXiv preprint arXiv:2407.20171, 2024

  31. [31]

    Visual objects in context,

    M. Bar, “Visual objects in context,” Nature Reviews Neuroscience, vol. 5, no. 8, pp. 617–629, 2004

  32. [32]

    Emotion experience and its varieties,

    N. H. Frijda, “Emotion experience and its varieties,” Emotion Review, vol. 1, no. 3, pp. 264–271, 2009

  33. [33]

    Affective image content analysis: Two decades review and new perspectives,

    S. Zhao, X. Yao, J. Yang, G. Jia, G. Ding, T.-S. Chua, B. W. Schuller, and K. Keutzer, “Affective image content analysis: Two decades review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 6729–6751, 2022

  34. [34]

    Fuzzy similarity-based emotional classification of color images,

    J. Lee and E. Park, “Fuzzy similarity-based emotional classification of color images,” IEEE Transactions on Multimedia , vol. 13, no. 5, pp. 1031–1039, 2011

  35. [35]

    Large-scale visual sentiment ontology and detectors using adjective noun pairs,

    D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scale visual sentiment ontology and detectors using adjective noun pairs,” in Proceedings of the 21st ACM International Conference on Multimedia , 2013, pp. 223–232

  36. [36]

    Affective image classification using features inspired by psychology and art theory,

    J. Machajdik and A. Hanbury, “Affective image classification using features inspired by psychology and art theory,” in Proceedings of the 18th ACM International Conference on Multimedia , ser. MM ’10. 13 New York, NY , USA: Association for Computing Machinery, 2010, p. 83–92. [Online]. Available: https://doi.org/10.1145/1873951.1873965

  37. [37]

    Mdan: Multi-level dependent attention network for visual emotion analysis,

    L. Xu, Z. Wang, B. Wu, and S. Lui, “Mdan: Multi-level dependent attention network for visual emotion analysis,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 9469–9478

  38. [38]

    Learning multi-level deep representations for image emotion classification,

    T. Rao, X. Li, and M. Xu, “Learning multi-level deep representations for image emotion classification,” Neural Processing Letters , vol. 51, pp. 2043–2061, 2020

  39. [39]

    Object semantics sentiment correlation analysis enhanced image sentiment classification,

    J. Zhang, M. Chen, H. Sun, D. Li, and Z. Wang, “Object semantics sentiment correlation analysis enhanced image sentiment classification,” Know.-Based Syst. , vol. 191, no. C, Mar. 2020. [Online]. Available: https://doi.org/10.1016/j.knosys.2019.105245

  40. [40]

    Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content,

    D. Borth, T. Chen, R. Ji, and S.-F. Chang, “Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content,” in Proceedings of the 21st ACM International Conference on Multimedia, 2013, pp. 459–460

  41. [41]

    Progressive visual content understanding network for image emotion classification,

    J. Pan and S. Wang, “Progressive visual content understanding network for image emotion classification,” in Proceedings of the 31st ACM International Conference on Multimedia , ser. MM ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 6034–6044. [Online]. Available: https://doi.org/10.1145/3581783.3612186

  42. [42]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image gen- eration and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021

  43. [43]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022

  44. [44]

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    Y . Balaji, S. Nah, X. Huang, A. Vahdat, J. Song, Q. Zhang, K. Kreis, M. Aittala, T. Aila, S. Laine, B. Catanzaro, T. Karras, and M.-Y . Liu, “ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers,” 2023. [Online]. Available: https://arxiv.org/abs/2211.01324

  45. [45]

    Photorealistic text-to-image diffusion models with deep language understanding,

    C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems , vol. 35, pp. 36 479–36 494, 2022

  46. [46]

    Text to image generation with semantic-spatial aware gan,

    W. Liao, K. Hu, M. Y . Yang, and B. Rosenhahn, “Text to image generation with semantic-spatial aware gan,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022, pp. 18 187–18 196

  47. [47]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    R. Gal, Y . Alaluf, Y . Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Person- alizing text-to-image generation using textual inversion,” arXiv preprint arXiv:2208.01618, 2022

  48. [48]

    Multi- concept customization of text-to-image diffusion,

    N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y . Zhu, “Multi- concept customization of text-to-image diffusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 1931–1941

  49. [49]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning , 2021, pp. 8748–8763

  50. [50]

    Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,

    Y . Wei, Y . Zhang, Z. Ji, J. Bai, L. Zhang, and W. Zuo, “Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,” arXiv preprint arXiv:2302.13848 , 2023

  51. [51]

    Emoedit: Evoking emotions through image manipulation,

    J. Yang, J. Feng, W. Luo, D. Lischinski, D. Cohen-Or, and H. Huang, “Emoedit: Evoking emotions through image manipulation,” arXiv preprint arXiv:2405.12661, 2024

  52. [52]

    Emotionprompt: Leveraging psychology for large language models enhancement via emotional stimulus (nd). retrieved 1 august 2023

    C. Li et al., “Emotionprompt: Leveraging psychology for large language models enhancement via emotional stimulus (nd). retrieved 1 august 2023.”

  53. [53]

    Emoticrafter: Text-to-emotional-image generation based on valence-arousal model,

    Y . He, S. Dang, L. Ling, Z. Qian, N. Zhao, and N. Cao, “Emoticrafter: Text-to-emotional-image generation based on valence-arousal model,” arXiv preprint arXiv:2501.05710 , 2025

  54. [54]

    Fal- con: Resolving visual redundancy and fragmentation in high-resolution multimodal large language models via visual registers,

    R. Zhang, R. Shao, G. Chen, K. Zhou, W. Guan, and L. Nie, “Fal- con: Resolving visual redundancy and fragmentation in high-resolution multimodal large language models via visual registers,” arXiv preprint arXiv:2501.16297, 2025

  55. [55]

    mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration,

    Q. Ye, H. Xu, J. Ye, M. Yan, A. Hu, H. Liu, Q. Qian, J. Zhang, and F. Huang, “mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration,” in Proceedings of the ieee/cvf conference on computer vision and pattern recognition , 2024, pp. 13 040–13 051

  56. [56]

    Qwen Technical Report

    J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang et al. , “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023

  57. [57]

    Lion-fs: Fast & slow video-language thinker as online video assistant,

    W. Li, B. Hu, R. Shao, L. Shen, and L. Nie, “Lion-fs: Fast & slow video-language thinker as online video assistant,” in Proceedings of the Computer Vision and Pattern Recognition Conference , 2025, pp. 3240– 3251

  58. [58]

    Lion: Empowering multimodal large language model with dual-level visual knowledge,

    G. Chen, L. Shen, R. Shao, X. Deng, and L. Nie, “Lion: Empowering multimodal large language model with dual-level visual knowledge,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 540–26 550

  59. [59]

    End-to-end object detection with transformers,

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision . Springer, 2020, pp. 213– 229

  60. [60]

    Object-centric learning with slot attention, 2020,

    F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf, “Object-centric learning with slot attention, 2020,” URL https://arxiv. org/abs, 2006

  61. [61]

    Exploring plain vision transformer backbones for object detection,

    Y . Li, H. Xu, Z. Wang, L. Zhang, and J. Sun, “Exploring plain vision transformer backbones for object detection,” in ECCV, 2022

  62. [62]

    Dino: Detr with improved denoising anchor boxes for end-to-end object detection,

    S. Zhang, Z. Wang, X. Wang, and J. Sun, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” in ICLR, 2023

  63. [63]

    Extract free dense labels from clip,

    C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” in European conference on computer vision. Springer, 2022, pp. 696–712

  64. [64]

    Denseclip: Language-guided dense prediction with context-aware prompting,

    Y . Rao, W. Zhao, G. Chen, Y . Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu, “Denseclip: Language-guided dense prediction with context-aware prompting,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2022, pp. 18 082–18 091

  65. [65]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    D. Alexey, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv: 2010.11929 , 2020

  66. [66]

    Emoset: A large-scale visual emotion dataset with rich attributes,

    J. Yang, Q. Huang, T. Ding, D. Lischinski, D. Cohen-Or, and H. Huang, “Emoset: A large-scale visual emotion dataset with rich attributes,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 20 383–20 394

  67. [67]

    Building a large scale dataset for image emotion recognition: The fine print and the benchmark,

    Q. You, J. Luo, H. Jin, and J. Yang, “Building a large scale dataset for image emotion recognition: The fine print and the benchmark,” in Proceedings of the AAAI conference on artificial intelligence , vol. 30, 2016

  68. [68]

    Where do emotions come from? predicting the emotion stimuli map,

    K.-C. Peng, A. Sadovnik, A. Gallagher, and T. Chen, “Where do emotions come from? predicting the emotion stimuli map,” in2016 IEEE international conference on image processing (ICIP) . IEEE, 2016, pp. 614–618

  69. [69]

    Robust image sentiment analysis using progressively trained and domain transferred deep networks,

    Q. You, J. Luo, H. Jin, and J. Yang, “Robust image sentiment analysis using progressively trained and domain transferred deep networks,” in Proceedings of the AAAI conference on Artificial Intelligence , vol. 29, no. 1, 2015

  70. [70]

    Emovit: Revolutionizing emotion insights with visual instruction tuning,

    H. Xie, C.-J. Peng, Y .-W. Tseng, H.-J. Chen, C.-F. Hsu, H.-H. Shuai, and W.-H. Cheng, “Emovit: Revolutionizing emotion insights with visual instruction tuning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 26 596–26 605

  71. [71]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium,

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in Neural Information Processing Systems , vol. 30, 2017

  72. [72]

    Learning emotional prompt features with multiple views for visual emotion analysis,

    Q. Xu, Y . Wei, S. Yuan, J. Wu, L. Wang, and C. Wu, “Learning emotional prompt features with multiple views for visual emotion analysis,” Information Fusion, vol. 108, p. 102366, 2024

  73. [73]

    Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” in International conference on machine learning . PMLR, 2023, pp. 19 730–19 742

  74. [74]

    Instructblip: towards general-purpose vision-language models with instruction tuning. arxiv,

    W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: towards general-purpose vision-language models with instruction tuning. arxiv,” Preprint posted online on June , vol. 15, no. 2023, p. 4, 2023

  75. [75]

    Flamingo: a visual language model for few-shot learning,

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al. , “Flamingo: a visual language model for few-shot learning,” Advances in neural information processing systems, vol. 35, pp. 23 716–23 736, 2022

  76. [76]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems , vol. 36, 2024

  77. [77]

    Learning to prompt for vision- language models,

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision- language models,” International Journal of Computer Vision , vol. 130, no. 9, pp. 2337–2348, 2022

  78. [78]

    Conditional prompt learning for vision-language models,

    ——, “Conditional prompt learning for vision-language models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 816–16 825. 14

  79. [79]

    Simemotion: A simple knowledgeable prompt tuning method for im- age emotion classification,

    S. Deng, G. Shi, L. Wu, L. Xing, W. Hu, H. Zhang, and Y . Xiang, “Simemotion: A simple knowledgeable prompt tuning method for im- age emotion classification,” in International Conference on Database Systems for Advanced Applications . Springer, 2022, pp. 222–229

  80. [80]

    Learning to compose diversified prompts for image emotion classifica- tion,

    S. Deng, L. Wu, G. Shi, L. Xing, M. Jian, Y . Xiang, and R. Dong, “Learning to compose diversified prompts for image emotion classifica- tion,” Computational Visual Media , pp. 1–15, 2024

Showing first 80 references.