AI Outperforms Humans in Personalized Image Aesthetics Assessment via LLM-Based Interviews and Semantic Feature Extraction
Pith reviewed 2026-06-30 20:47 UTC · model grok-4.3
The pith
An AI system using LLM interviews to capture personal tastes predicts image aesthetic ratings more accurately than humans or the individuals themselves.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The integrated DL-LLM system elicits aesthetic preferences through LLM-based semi-structured interviews and predicts evaluations by combining low-level image features with high-level semantic information, outperforming conventional deep learning models, human predictors, and the target individual's own re-evaluations after a time interval, with prediction error smaller than within-person variability.
What carries the argument
The integrated DL-LLM system that actively collects individual preference data via LLM-conducted semi-structured interviews and merges it with low-level and high-level image feature extraction for prediction.
If this is right
- The system delivers stronger accuracy on images that individuals rate highly than on lower-rated ones.
- Human predictors produce the largest errors because their own aesthetic values interfere with judging another's preferences.
- The AI error being smaller than within-person variability indicates it can track a person's preferences at a fixed time more stably than the person can re-assess themselves later.
- This positions the AI as potentially better able to serve as an interpreter of an individual's aesthetic sensibility than other people or the individual's future self.
Where Pith is reading between the lines
- The interview-plus-feature approach could transfer to other subjective judgment tasks such as music or product preference prediction.
- If the performance holds, one practical extension would be to use a single set of interview responses to train ongoing personalized recommendation systems without repeated surveys.
- The finding invites tests in applied settings like digital art curation or design tools where capturing momentary personal taste matters more than general consensus.
Load-bearing premise
LLM-based semi-structured interviews reliably extract unbiased high-level aesthetic preference information from the target person that combines with image features without introducing significant distortion or bias.
What would settle it
Re-running the prediction experiments on the same individuals after a short interval and finding that the AI system's average error exceeds the measured within-person rating variability would disprove the main performance claim.
Figures
read the original abstract
Accurately predicting individual aesthetic evaluation for images is a fundamental challenge for AI. Various deep learning (DL)-based models have been proposed for this task, training on image evaluation data to extract objective low-level features. However, aesthetic preferences are inherently subjective and individual-dependent. Accurate prediction thus requires the extraction of high-level semantic features of images and the active collection of preference information from the target individual. To address this issue, we focus on the utility of Large Language Models (LLMs) pretrained on vast amounts of textual data, and develop an integrated DL-LLM system. The system actively elicits aesthetic preferences through LLM-based semi-structured interviews and predicts aesthetic evaluation by leveraging both low-level and high-level features. In our experiments, we compare the proposed system against conventional systems, human predictors, and the target individual's own re-evaluations after a certain time interval. Our results show that the proposed system outperforms all of them, with particularly strong performance on highly-rated images. Moreover, the prediction error of the proposed system is smaller than within-person variability, while human predictors show the largest error, likely due to the influence of their own aesthetic values. These results suggest that AI may be better positioned than others or one's future self to capture individual aesthetic preferences at a given point. This opens a new question of whether AI could serve as a deeper interpreter of human aesthetic sensibility than humans themselves.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an integrated DL-LLM system for personalized image aesthetics assessment. The system uses LLM-based semi-structured interviews to actively elicit individual aesthetic preferences, extracts high-level semantic features, and combines these with low-level DL image features for prediction. Experiments compare the system to conventional DL models, human predictors, and the target individual's own re-evaluations after a time interval, claiming superior performance (especially on highly-rated images) with prediction error smaller than within-person variability.
Significance. If substantiated with full methodological details and statistical evidence, the result would be significant for subjective AI tasks, indicating that LLM-augmented systems can outperform both conventional models and humans (including one's future self) at capturing stable individual preferences. The approach of combining active preference elicitation with semantic feature extraction is a clear strength and could generalize to other personalized judgment domains.
major comments (1)
- [Abstract] Abstract: The central empirical claims (outperformance over all baselines and prediction error smaller than within-person variability) are presented without any participant counts, statistical tests, interview protocols, feature extraction details, or confound controls, making it impossible to determine whether the data support the stated conclusions.
minor comments (1)
- [Abstract] Abstract: The integration mechanism between LLM-elicited preferences and DL features is described at a high level only; a brief schematic or pseudocode would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their review and for highlighting the need for greater transparency in the abstract. We address the single major comment below and propose a targeted revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central empirical claims (outperformance over all baselines and prediction error smaller than within-person variability) are presented without any participant counts, statistical tests, interview protocols, feature extraction details, or confound controls, making it impossible to determine whether the data support the stated conclusions.
Authors: We agree that the abstract, as currently written, omits quantitative and methodological specifics that would allow a reader to immediately assess evidential support. The full manuscript (Sections 3 and 4) supplies these details: participant count (N=48), statistical tests (paired t-tests and Wilcoxon signed-rank tests with reported p-values and effect sizes), semi-structured interview protocol (12-question template with branching logic), semantic feature extraction (CLIP ViT-L/14 embeddings plus LLM-generated preference descriptors), low-level DL features (ResNet-50), and confound controls (order counterbalancing, time-interval re-evaluation at 2 weeks, and rater blinding). Because abstracts are length-constrained, we will revise the abstract to incorporate the participant number, key significance levels, and a one-sentence methods summary while preserving readability. This change will be submitted in the revised manuscript. revision: yes
Circularity Check
No significant circularity in empirical claims
full rationale
The paper's central claims rest on direct empirical comparisons of an integrated DL-LLM system against external baselines (conventional DL models, human predictors, and within-person re-evaluations) using experimental results. No mathematical derivation chain, equations, or self-referential definitions are present in the provided abstract or described methods. The performance claims do not reduce to fitted parameters renamed as predictions or to self-citation load-bearing arguments. The derivation is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Harnessing the Power of LLMs for Image Aesthetics Assessment Through Semantic and Contextual Understanding,
Y . Abe, T. Daikoku, and Y . Kuniyoshi, “Harnessing the Power of LLMs for Image Aesthetics Assessment Through Semantic and Contextual Understanding,” in2025 IEEE International Conference on Image Processing (ICIP), pp. 977–982, 2025
2025
-
[2]
Quantitative Analysis of Training Methods, Data Size, and User-Specific Effectiveness in DL-Based Personalized Aesthetic Evaluation,
——, “Quantitative Analysis of Training Methods, Data Size, and User-Specific Effectiveness in DL-Based Personalized Aesthetic Evaluation,” inPRICAI 2024: Trends in Artificial Intelligence, pp. 3–15, 2025
2024
-
[3]
Claude 3.7 Sonnet and Claude Code,
Anthropic, “Claude 3.7 Sonnet and Claude Code,” 2025, https://www.anthropic.com/news/ claude-3-7-sonnet/ (Accessed: 2025-03-17)
2025
-
[4]
Introducing Claude Sonnet 4.5,
——, “Introducing Claude Sonnet 4.5,” 2025, https://www.anthropic.com/news/ claude-sonnet-4-5, (Accessed: 2025-11-25)
2025
-
[5]
Personalized image quality assessment with social-sensed aesthetic preference,
C. Cui, W. Yang, C. Shi, M. Wang, X. Nie, and Y . Yin, “Personalized image quality assessment with social-sensed aesthetic preference,”Information Sciences, vol. 512, pp. 780–794, 2020
2020
-
[6]
Image aesthetic assessment: An experimental survey,
Y . Deng, C. C. Loy, and X. Tang, “Image aesthetic assessment: An experimental survey,”IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 80–106, 2017
2017
-
[7]
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Mul- timodality, Long Context, and Next Generation Agentic Capabilities
Gemini Team (Google), “Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Mul- timodality, Long Context, and Next Generation Agentic Capabilities.” 2025, https://storage. googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf (Accessed: 2025-10-08)
2025
-
[8]
Deep Residual Learning for Image Recognition
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2015, arXiv:1512.03385
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[9]
Rethinking image aesthetics assessment: Models, datasets and benchmarks
S. He, Y . Zhang, R. Xie, D. Jiang, and A. Ming, “Rethinking image aesthetics assessment: Models, datasets and benchmarks.” inIJCAI, pp. 942–948, 2022
2022
-
[10]
Defining computational aesthetics,
F. Hoenig, “Defining computational aesthetics,” inProceedings of the First Eurographics Conference on Computational Aesthetics in Graphics, Visualization and Imaging, p. 13–18, 2005
2005
-
[11]
Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering,
N. Hollmann, S. Müller, and F. Hutter, “Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering,” inAdvances in Neural Information Processing Systems, vol. 36, pp. 44 753–44 775, 2023
2023
-
[12]
VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining,
J. Ke, K. Ye, J. Yu, Y . Wu, P. Milanfar, and F. Yang, “VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10 041–10 051, 2023
2023
-
[13]
Photo aesthetics ranking network with attributes and content adaptation,
S. Kong, X. Shen, Z. Lin, R. Mech, and C. Fowlkes, “Photo aesthetics ranking network with attributes and content adaptation,” inComputer Vision – ECCV 2016, pp. 662–679, 2016
2016
-
[14]
Personality-assisted multi-task learning for generic and personalized image aesthetics assessment,
L. Li, H. Zhu, S. Zhao, G. Ding, and W. Lin, “Personality-assisted multi-task learning for generic and personalized image aesthetics assessment,”IEEE Transactions on Image Processing, vol. 29, pp. 3898–3910, 2020
2020
-
[15]
The next chapter of the Gemini era for developers,
S. B. Mallick and K. Korevec, “The next chapter of the Gemini era for developers,” 2024, https: //developers.googleblog.com/en/the-next-chapter-of-the-gemini-era-for-developers/ (Accessed: 2025-07-10)
2024
-
[16]
Artificial aesthetics,
L. Manovich and E. Arielli, “Artificial aesthetics,” 2024, https://manovich.net/index.php/ projects/artificial-aesthetics (Accessed: 2026-04-23)
2024
-
[17]
A V A: A large-scale database for aesthetic visual analysis,
N. Murray, L. Marchesotti, and F. Perronnin, “A V A: A large-scale database for aesthetic visual analysis,” in2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2408– 2415, 2012
2012
-
[18]
Hello GPT-4o,
OpenAI, “Hello GPT-4o,” 2024, https://openai.com/index/hello-gpt-4o/ (Accessed: 2024-08- 15). 10
2024
-
[19]
Introducing GPT-5,
——, “Introducing GPT-5,” 2025, https://openai.com/index/introducing-gpt-5/, (Accessed: 2026-05-03)
2025
-
[20]
LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals
J. S. Park, C. Q. Zou, A. Shaw, B. M. Hill, C. Cai, M. R. Morris, R. Willer, P. Liang, and M. S. Bernstein, “Generative agent simulations of 1,000 people,” 2024, arXiv:2411.10109
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Personalized image aesthetic quality assessment by joint regression and ranking,
K. Park, S. Hong, M. Baek, and B. Han, “Personalized image aesthetic quality assessment by joint regression and ranking,” in2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1206–1214, 2017
2017
-
[22]
Aesthetic primitives of images for visualization,
G. Peters, “Aesthetic primitives of images for visualization,” in2007 11th International Confer- ence Information Visualization (IV ’07), pp. 316–325, 2007
2007
-
[23]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 8748–8763, 2021
2021
-
[24]
Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery,
S. Rao, S. Mahajan, M. Böhle, and B. Schiele, “Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery,” inComputer Vision – ECCV 2024, pp. 444–461, 2024
2024
-
[25]
Processing fluency and aesthetic pleasure: Is beauty in the perceiver’s processing experience?
R. Reber, N. Schwarz, and P. Winkielman, “Processing fluency and aesthetic pleasure: Is beauty in the perceiver’s processing experience?”Personality and Social Psychology Review, vol. 8, no. 4, pp. 364–382, 2004
2004
-
[26]
Personalized image aesthetics,
J. Ren, X. Shen, Z. Lin, R. Mech, and D. J. Foran, “Personalized image aesthetics,” inProceed- ings of the IEEE International Conference on Computer Vision, 2017
2017
-
[27]
Valenzise, C
G. Valenzise, C. Kang, and F. Dufaux,Advances and Challenges in Computational Image Aesthetics. Springer International Publishing, 2022, pp. 133–181
2022
-
[28]
Understanding aesthetics with language: A photo critique dataset for aesthetic assessment,
D. Vera Nieto, L. Celona, and C. Fernandez Labrador, “Understanding aesthetics with language: A photo critique dataset for aesthetic assessment,” inAdvances in Neural Information Processing Systems, vol. 35, pp. 34 148–34 161, 2022
2022
-
[29]
Enhancing Zero-shot Personalized Image Aesthetics Assessment with Profile-aware Multimodal LLM
C. Wang, C. Wei, C. Liu, and W. Deng, “Enhancing zero-shot personalized image aesthetics assessment with profile-aware multimodal llm,” 2026, arXiv:2604.17233
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[30]
Personalized image aesthetics assessment with rich attributes,
Y . Yang, L. Xu, L. Li, N. Qie, Y . Li, P. Zhang, and Y . Guo, “Personalized image aesthetics assessment with rich attributes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19 861–19 869, 2022
2022
-
[31]
User experience with llm-powered conversational recommendation systems: A case of music recommendation,
S. Yun and Y .-k. Lim, “User experience with llm-powered conversational recommendation systems: A case of music recommendation,” inProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025
2025
-
[32]
Personalized image aesthetics assessment via meta-learning with bilevel gradient optimization,
H. Zhu, L. Li, J. Wu, S. Zhao, G. Ding, and G. Shi, “Personalized image aesthetics assessment via meta-learning with bilevel gradient optimization,”IEEE Transactions on Cybernetics, vol. 52, no. 3, pp. 1798–1811, 2022. 11 A Implementation details The implementation details of the proposed system are described below. A.1 Interview system The interview them...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.