pith. sign in

arxiv: 2605.14761 · v1 · pith:KUOCIPC2new · submitted 2026-05-14 · 💻 cs.AI · cs.HC

AI Outperforms Humans in Personalized Image Aesthetics Assessment via LLM-Based Interviews and Semantic Feature Extraction

Pith reviewed 2026-06-30 20:47 UTC · model grok-4.3

classification 💻 cs.AI cs.HC
keywords personalized image aestheticsLLM interviewsaesthetic predictiondeep learningsubjective assessmentAI preference modelingindividual aesthetic judgment
0
0 comments X

The pith

An AI system using LLM interviews to capture personal tastes predicts image aesthetic ratings more accurately than humans or the individuals themselves.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds an integrated system that pairs deep learning models for extracting image features with large language models that conduct semi-structured interviews to gather an individual's specific aesthetic preferences. Experiments compare its predictions against standard deep learning models, other human raters, and the same person's ratings after a time gap. The system achieves lower error rates overall and especially on images the person rates highly, and its errors fall below the natural variation seen in a single person's repeated ratings. This result implies that the AI captures a snapshot of individual sensibility more consistently than self-assessment or external judges at that moment.

Core claim

The integrated DL-LLM system elicits aesthetic preferences through LLM-based semi-structured interviews and predicts evaluations by combining low-level image features with high-level semantic information, outperforming conventional deep learning models, human predictors, and the target individual's own re-evaluations after a time interval, with prediction error smaller than within-person variability.

What carries the argument

The integrated DL-LLM system that actively collects individual preference data via LLM-conducted semi-structured interviews and merges it with low-level and high-level image feature extraction for prediction.

If this is right

  • The system delivers stronger accuracy on images that individuals rate highly than on lower-rated ones.
  • Human predictors produce the largest errors because their own aesthetic values interfere with judging another's preferences.
  • The AI error being smaller than within-person variability indicates it can track a person's preferences at a fixed time more stably than the person can re-assess themselves later.
  • This positions the AI as potentially better able to serve as an interpreter of an individual's aesthetic sensibility than other people or the individual's future self.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The interview-plus-feature approach could transfer to other subjective judgment tasks such as music or product preference prediction.
  • If the performance holds, one practical extension would be to use a single set of interview responses to train ongoing personalized recommendation systems without repeated surveys.
  • The finding invites tests in applied settings like digital art curation or design tools where capturing momentary personal taste matters more than general consensus.

Load-bearing premise

LLM-based semi-structured interviews reliably extract unbiased high-level aesthetic preference information from the target person that combines with image features without introducing significant distortion or bias.

What would settle it

Re-running the prediction experiments on the same individuals after a short interval and finding that the AI system's average error exceeds the measured within-person rating variability would disprove the main performance claim.

Figures

Figures reproduced from arXiv: 2605.14761 by Tatsuya Daikoku, Yasuo Kuniyoshi, Yoshia Abe.

Figure 1
Figure 1. Figure 1: Overview of the Interview System. The system consists of two asynchronous agents [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Prediction System. The system consists of three modules: a DL module that [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Violin plots of the per-sample difference in absolute prediction error between the DL-based [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of MAE across all predictors. Bars show mean [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pearson correlation coefficients between prediction errors [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example of applicability evaluation. An LLM is provided with a feature name, its [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: User interface for the aesthetic rating procedure. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: User interface for the interview phase. B.2 Predictor configurations The test set of 45 images out of 300 images was held out and shared across predictors. B.2.1 The proposed system Regarding the LLMs used, both the Interviewer agent and the Analyzer agent employed Claude Sonnet 4.5 (claude-sonnet-4-5-20250929) [4]. For the Prediction System, Claude Sonnet 4.5 (claude-sonnet￾4-5-20250929) was used for feat… view at source ↗
Figure 9
Figure 9. Figure 9: Overview of the DL-based predictor. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Overview of the LLM-based predictor. B.2.4 Baseline 3: Human predictor Each participant predicted the aesthetic ratings of other participants (referred to as target individuals). Specifically, each of the 30 participants predicted the ratings of 5 individuals selected from the remaining 29 individuals. Each predictor was anonymously presented with the rating data (image￾score pairs) of the target individu… view at source ↗
Figure 11
Figure 11. Figure 11: Mean scores of each predictor based on subjective comparisons. From left to right, the [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Results of a questionnaire survey on the interview experiences. The red dashed lines [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Spearman’s rank correlation matrix of evaluation items for the interview experience. To [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
read the original abstract

Accurately predicting individual aesthetic evaluation for images is a fundamental challenge for AI. Various deep learning (DL)-based models have been proposed for this task, training on image evaluation data to extract objective low-level features. However, aesthetic preferences are inherently subjective and individual-dependent. Accurate prediction thus requires the extraction of high-level semantic features of images and the active collection of preference information from the target individual. To address this issue, we focus on the utility of Large Language Models (LLMs) pretrained on vast amounts of textual data, and develop an integrated DL-LLM system. The system actively elicits aesthetic preferences through LLM-based semi-structured interviews and predicts aesthetic evaluation by leveraging both low-level and high-level features. In our experiments, we compare the proposed system against conventional systems, human predictors, and the target individual's own re-evaluations after a certain time interval. Our results show that the proposed system outperforms all of them, with particularly strong performance on highly-rated images. Moreover, the prediction error of the proposed system is smaller than within-person variability, while human predictors show the largest error, likely due to the influence of their own aesthetic values. These results suggest that AI may be better positioned than others or one's future self to capture individual aesthetic preferences at a given point. This opens a new question of whether AI could serve as a deeper interpreter of human aesthetic sensibility than humans themselves.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes an integrated DL-LLM system for personalized image aesthetics assessment. The system uses LLM-based semi-structured interviews to actively elicit individual aesthetic preferences, extracts high-level semantic features, and combines these with low-level DL image features for prediction. Experiments compare the system to conventional DL models, human predictors, and the target individual's own re-evaluations after a time interval, claiming superior performance (especially on highly-rated images) with prediction error smaller than within-person variability.

Significance. If substantiated with full methodological details and statistical evidence, the result would be significant for subjective AI tasks, indicating that LLM-augmented systems can outperform both conventional models and humans (including one's future self) at capturing stable individual preferences. The approach of combining active preference elicitation with semantic feature extraction is a clear strength and could generalize to other personalized judgment domains.

major comments (1)
  1. [Abstract] Abstract: The central empirical claims (outperformance over all baselines and prediction error smaller than within-person variability) are presented without any participant counts, statistical tests, interview protocols, feature extraction details, or confound controls, making it impossible to determine whether the data support the stated conclusions.
minor comments (1)
  1. [Abstract] Abstract: The integration mechanism between LLM-elicited preferences and DL features is described at a high level only; a brief schematic or pseudocode would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need for greater transparency in the abstract. We address the single major comment below and propose a targeted revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central empirical claims (outperformance over all baselines and prediction error smaller than within-person variability) are presented without any participant counts, statistical tests, interview protocols, feature extraction details, or confound controls, making it impossible to determine whether the data support the stated conclusions.

    Authors: We agree that the abstract, as currently written, omits quantitative and methodological specifics that would allow a reader to immediately assess evidential support. The full manuscript (Sections 3 and 4) supplies these details: participant count (N=48), statistical tests (paired t-tests and Wilcoxon signed-rank tests with reported p-values and effect sizes), semi-structured interview protocol (12-question template with branching logic), semantic feature extraction (CLIP ViT-L/14 embeddings plus LLM-generated preference descriptors), low-level DL features (ResNet-50), and confound controls (order counterbalancing, time-interval re-evaluation at 2 weeks, and rater blinding). Because abstracts are length-constrained, we will revise the abstract to incorporate the participant number, key significance levels, and a one-sentence methods summary while preserving readability. This change will be submitted in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical claims

full rationale

The paper's central claims rest on direct empirical comparisons of an integrated DL-LLM system against external baselines (conventional DL models, human predictors, and within-person re-evaluations) using experimental results. No mathematical derivation chain, equations, or self-referential definitions are present in the provided abstract or described methods. The performance claims do not reduce to fitted parameters renamed as predictions or to self-citation load-bearing arguments. The derivation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the approach builds on standard deep learning and LLM technologies without new postulated constructs.

pith-pipeline@v0.9.1-grok · 5786 in / 1153 out tokens · 46815 ms · 2026-06-30T20:47:21.934589+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 3 canonical work pages · 3 internal anchors

  1. [1]

    Harnessing the Power of LLMs for Image Aesthetics Assessment Through Semantic and Contextual Understanding,

    Y . Abe, T. Daikoku, and Y . Kuniyoshi, “Harnessing the Power of LLMs for Image Aesthetics Assessment Through Semantic and Contextual Understanding,” in2025 IEEE International Conference on Image Processing (ICIP), pp. 977–982, 2025

  2. [2]

    Quantitative Analysis of Training Methods, Data Size, and User-Specific Effectiveness in DL-Based Personalized Aesthetic Evaluation,

    ——, “Quantitative Analysis of Training Methods, Data Size, and User-Specific Effectiveness in DL-Based Personalized Aesthetic Evaluation,” inPRICAI 2024: Trends in Artificial Intelligence, pp. 3–15, 2025

  3. [3]

    Claude 3.7 Sonnet and Claude Code,

    Anthropic, “Claude 3.7 Sonnet and Claude Code,” 2025, https://www.anthropic.com/news/ claude-3-7-sonnet/ (Accessed: 2025-03-17)

  4. [4]

    Introducing Claude Sonnet 4.5,

    ——, “Introducing Claude Sonnet 4.5,” 2025, https://www.anthropic.com/news/ claude-sonnet-4-5, (Accessed: 2025-11-25)

  5. [5]

    Personalized image quality assessment with social-sensed aesthetic preference,

    C. Cui, W. Yang, C. Shi, M. Wang, X. Nie, and Y . Yin, “Personalized image quality assessment with social-sensed aesthetic preference,”Information Sciences, vol. 512, pp. 780–794, 2020

  6. [6]

    Image aesthetic assessment: An experimental survey,

    Y . Deng, C. C. Loy, and X. Tang, “Image aesthetic assessment: An experimental survey,”IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 80–106, 2017

  7. [7]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Mul- timodality, Long Context, and Next Generation Agentic Capabilities

    Gemini Team (Google), “Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Mul- timodality, Long Context, and Next Generation Agentic Capabilities.” 2025, https://storage. googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf (Accessed: 2025-10-08)

  8. [8]

    Deep Residual Learning for Image Recognition

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2015, arXiv:1512.03385

  9. [9]

    Rethinking image aesthetics assessment: Models, datasets and benchmarks

    S. He, Y . Zhang, R. Xie, D. Jiang, and A. Ming, “Rethinking image aesthetics assessment: Models, datasets and benchmarks.” inIJCAI, pp. 942–948, 2022

  10. [10]

    Defining computational aesthetics,

    F. Hoenig, “Defining computational aesthetics,” inProceedings of the First Eurographics Conference on Computational Aesthetics in Graphics, Visualization and Imaging, p. 13–18, 2005

  11. [11]

    Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering,

    N. Hollmann, S. Müller, and F. Hutter, “Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering,” inAdvances in Neural Information Processing Systems, vol. 36, pp. 44 753–44 775, 2023

  12. [12]

    VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining,

    J. Ke, K. Ye, J. Yu, Y . Wu, P. Milanfar, and F. Yang, “VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10 041–10 051, 2023

  13. [13]

    Photo aesthetics ranking network with attributes and content adaptation,

    S. Kong, X. Shen, Z. Lin, R. Mech, and C. Fowlkes, “Photo aesthetics ranking network with attributes and content adaptation,” inComputer Vision – ECCV 2016, pp. 662–679, 2016

  14. [14]

    Personality-assisted multi-task learning for generic and personalized image aesthetics assessment,

    L. Li, H. Zhu, S. Zhao, G. Ding, and W. Lin, “Personality-assisted multi-task learning for generic and personalized image aesthetics assessment,”IEEE Transactions on Image Processing, vol. 29, pp. 3898–3910, 2020

  15. [15]

    The next chapter of the Gemini era for developers,

    S. B. Mallick and K. Korevec, “The next chapter of the Gemini era for developers,” 2024, https: //developers.googleblog.com/en/the-next-chapter-of-the-gemini-era-for-developers/ (Accessed: 2025-07-10)

  16. [16]

    Artificial aesthetics,

    L. Manovich and E. Arielli, “Artificial aesthetics,” 2024, https://manovich.net/index.php/ projects/artificial-aesthetics (Accessed: 2026-04-23)

  17. [17]

    A V A: A large-scale database for aesthetic visual analysis,

    N. Murray, L. Marchesotti, and F. Perronnin, “A V A: A large-scale database for aesthetic visual analysis,” in2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2408– 2415, 2012

  18. [18]

    Hello GPT-4o,

    OpenAI, “Hello GPT-4o,” 2024, https://openai.com/index/hello-gpt-4o/ (Accessed: 2024-08- 15). 10

  19. [19]

    Introducing GPT-5,

    ——, “Introducing GPT-5,” 2025, https://openai.com/index/introducing-gpt-5/, (Accessed: 2026-05-03)

  20. [20]

    LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

    J. S. Park, C. Q. Zou, A. Shaw, B. M. Hill, C. Cai, M. R. Morris, R. Willer, P. Liang, and M. S. Bernstein, “Generative agent simulations of 1,000 people,” 2024, arXiv:2411.10109

  21. [21]

    Personalized image aesthetic quality assessment by joint regression and ranking,

    K. Park, S. Hong, M. Baek, and B. Han, “Personalized image aesthetic quality assessment by joint regression and ranking,” in2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1206–1214, 2017

  22. [22]

    Aesthetic primitives of images for visualization,

    G. Peters, “Aesthetic primitives of images for visualization,” in2007 11th International Confer- ence Information Visualization (IV ’07), pp. 316–325, 2007

  23. [23]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 8748–8763, 2021

  24. [24]

    Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery,

    S. Rao, S. Mahajan, M. Böhle, and B. Schiele, “Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery,” inComputer Vision – ECCV 2024, pp. 444–461, 2024

  25. [25]

    Processing fluency and aesthetic pleasure: Is beauty in the perceiver’s processing experience?

    R. Reber, N. Schwarz, and P. Winkielman, “Processing fluency and aesthetic pleasure: Is beauty in the perceiver’s processing experience?”Personality and Social Psychology Review, vol. 8, no. 4, pp. 364–382, 2004

  26. [26]

    Personalized image aesthetics,

    J. Ren, X. Shen, Z. Lin, R. Mech, and D. J. Foran, “Personalized image aesthetics,” inProceed- ings of the IEEE International Conference on Computer Vision, 2017

  27. [27]

    Valenzise, C

    G. Valenzise, C. Kang, and F. Dufaux,Advances and Challenges in Computational Image Aesthetics. Springer International Publishing, 2022, pp. 133–181

  28. [28]

    Understanding aesthetics with language: A photo critique dataset for aesthetic assessment,

    D. Vera Nieto, L. Celona, and C. Fernandez Labrador, “Understanding aesthetics with language: A photo critique dataset for aesthetic assessment,” inAdvances in Neural Information Processing Systems, vol. 35, pp. 34 148–34 161, 2022

  29. [29]

    Enhancing Zero-shot Personalized Image Aesthetics Assessment with Profile-aware Multimodal LLM

    C. Wang, C. Wei, C. Liu, and W. Deng, “Enhancing zero-shot personalized image aesthetics assessment with profile-aware multimodal llm,” 2026, arXiv:2604.17233

  30. [30]

    Personalized image aesthetics assessment with rich attributes,

    Y . Yang, L. Xu, L. Li, N. Qie, Y . Li, P. Zhang, and Y . Guo, “Personalized image aesthetics assessment with rich attributes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19 861–19 869, 2022

  31. [31]

    User experience with llm-powered conversational recommendation systems: A case of music recommendation,

    S. Yun and Y .-k. Lim, “User experience with llm-powered conversational recommendation systems: A case of music recommendation,” inProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025

  32. [32]

    Personalized image aesthetics assessment via meta-learning with bilevel gradient optimization,

    H. Zhu, L. Li, J. Wu, S. Zhao, G. Ding, and G. Shi, “Personalized image aesthetics assessment via meta-learning with bilevel gradient optimization,”IEEE Transactions on Cybernetics, vol. 52, no. 3, pp. 1798–1811, 2022. 11 A Implementation details The implementation details of the proposed system are described below. A.1 Interview system The interview them...