Image Score: Learning and Evaluating Human Preferences for Mercari Search
Pith reviewed 2026-05-23 21:41 UTC · model grok-4.3
The pith
LLM chain-of-thought prompts produce image aesthetics labels that align with Mercari user behavior and increase sales.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An LLM equipped with chain-of-thought prompting can generate image aesthetics labels for e-commerce items that correlate with implicit user feedback such as clicks and purchases, and incorporating these labels into the ranking algorithm leads to higher sales in an online experiment.
What carries the argument
Chain-of-thought prompting on a large language model to generate explainable aesthetics scores for product images.
If this is right
- Image quality assessment becomes feasible at scale without direct human annotation.
- Search systems can optimize for visual appeal in addition to relevance.
- Explanations from the model support debugging and customer experience improvements.
- The approach serves as a low-cost way to test new labeling strategies before full deployment.
Where Pith is reading between the lines
- Similar prompting techniques might work for other subjective judgments like item condition or trendiness.
- If the LLM has biases from its training corpus, the labels could systematically favor certain styles of photography.
- Further experiments could test whether the sales lift persists over longer periods or across different product categories.
Load-bearing premise
That the aesthetics labels generated by the LLM accurately reflect genuine human visual preferences rather than being shaped by the model's pretraining or the specific prompt wording.
What would settle it
Running a head-to-head comparison where the same set of images is rated both by the LLM and by human judges, then checking which set of labels better predicts actual user engagement metrics on the platform.
Figures
read the original abstract
Mercari is the largest C2C e-commerce marketplace in Japan, having more than 20 million active monthly users. Search being the fundamental way to discover desired items, we have always had a substantial amount of data with implicit feedback. Although we actively take advantage of that to provide the best service for our users, the correlation of implicit feedback for such tasks as image quality assessment is not trivial. Many traditional lines of research in Machine Learning (ML) are similarly motivated by the insatiable appetite of Deep Learning (DL) models for well-labelled training data. Weak supervision is about leveraging higher-level and/or noisier supervision over unlabeled data. Large Language Models (LLMs) are being actively studied and used for data labelling tasks. We present how we leverage a Chain-of-Thought (CoT) to enable LLM to produce image aesthetics labels that correlate well with human behavior in e-commerce settings. Leveraging LLMs is more cost-effective compared to explicit human judgment, while significantly improving the explainability of deep image quality evaluation which is highly important for customer journey optimization at Mercari. We propose a cost-efficient LLM-driven approach for assessing and predicting image quality in e-commerce settings, which is very convenient for proof-of-concept testing. We show that our LLM-produced labels correlate with user behavior on Mercari. Finally, we show our results from an online experimentation, where we achieved a significant growth in sales on the web platform.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents an LLM-based approach using Chain-of-Thought prompting to generate image aesthetics labels for items on the Mercari C2C marketplace. It claims these labels correlate with user behavior on the platform and that deploying them in an online experiment produced significant sales growth.
Significance. If the empirical results are properly validated, the work demonstrates a scalable and cost-effective alternative to human labeling for image quality assessment in e-commerce search, with potential to improve ranking and user experience.
major comments (1)
- [Abstract and Online Experimentation] Abstract and results on online experimentation: the claim of 'significant growth in sales' from deploying the image scores provides no description of the A/B test design, primary metric (GMV, conversion, or CTR), statistical tests, sample sizes, power analysis, controls for concurrent ranking or UI changes, holdout period, or exclusion criteria. This prevents verification that the lift is causally attributable to the image scores rather than confounding factors.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for greater transparency in our description of the online experiment. We address this point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and Online Experimentation] Abstract and results on online experimentation: the claim of 'significant growth in sales' from deploying the image scores provides no description of the A/B test design, primary metric (GMV, conversion, or CTR), statistical tests, sample sizes, power analysis, controls for concurrent ranking or UI changes, holdout period, or exclusion criteria. This prevents verification that the lift is causally attributable to the image scores rather than confounding factors.
Authors: We agree that the current manuscript provides insufficient detail on the online experiment to allow independent verification of causality. In the revised version we will add a dedicated subsection (and expand the abstract) that specifies: the A/B test design and randomization procedure, the primary metric (GMV), sample sizes and power analysis, the statistical tests employed, controls for concurrent ranking or UI changes, holdout periods, and exclusion criteria. These additions will make explicit that the reported sales lift is attributable to the image-score intervention. revision: yes
Circularity Check
No circularity; empirical claims rest on observed correlations and A/B results
full rationale
The paper presents an empirical workflow: LLM-generated image aesthetics labels (via CoT prompting) are shown to correlate with Mercari user behavior, followed by an online experiment reporting sales growth. No equations, fitted parameters, or first-principles derivations appear. No step reduces a claimed prediction to a self-defined input, a fitted subset, or a self-citation chain. The central results are external benchmarks (user logs, A/B lift) rather than quantities defined in terms of themselves. This is the normal case of a non-circular empirical ML paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Chain-of-thought prompting enables LLMs to generate image aesthetics labels that align with human preferences in e-commerce
Reference graph
Works this paper leans on
-
[1]
Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023). Javad Azimi, Ruofei Zhang, Yang Zhou, Vidhya Navalpakkam, Jianchang Mao, and Xiaoli Fern
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
The Impact of Visual Appearance on User Response in Online Display Advertising
The Impact of Visual Appearance on User Response in Online Display Advertising. arXiv:1202.2158 [cs.HC] Fabiano Muniz Belém, Alexandre Maros, Sérgio D. Canuto, Rodrigo M. Silva, Jussara M. Almeida, and Marcos André Gonçalves
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
arXiv preprint arXiv:2305.10843 (2023)
X-iqe: explainable image quality evaluation for text-to-image generation with visual large language models. arXiv preprint arXiv:2305.10843 (2023). Manri Cheon, Sung-Jun Yoon, Byungyeon Kang, and Junwoo Lee
-
[4]
Studying aesthetics in photographic images using a computational approach. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006, Proceedings, Part III
work page 2006
-
[5]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár
Handling position bias for unbiased learning to rank in hotels search.arXiv preprint arXiv:2002.12528 (2020). Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár
-
[6]
SGDR: Stochastic Gradient Descent with Warm Restarts
Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016). Ilya Loshchilov and Frank Hutter
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[7]
Decoupled Weight Decay Regularization
Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017). Xiao Ma, Lina Mezghani, Kimberly Wilber, Hui Hong, Robinson Piramuthu, Mor Naaman, and Serge Belongie
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[8]
In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV)
Understanding Image Quality and Trust in Peer-to-Peer Marketplaces. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). 511–520. https://doi.org/10.1109/WACV.2019.00060 Zhaoqing Pan, Feng Yuan, Jianjun Lei, Yuming Fang, Xiao Shao, and Sam Kwong
-
[9]
IEEE Transactions on Image Processing31 (2022), 1613–1627
VCRNet: Visual compensation restoration network for no-reference image quality assessment. IEEE Transactions on Image Processing31 (2022), 1613–1627. Richard Yuanzhe Pang, Stephen Roller, Kyunghyun Cho, He He, and Jason Weston
work page 2022
-
[10]
arXiv preprint arXiv:2307.14117 (2023)
Leveraging implicit feedback from deployment data in dialogue. arXiv preprint arXiv:2307.14117 (2023). Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al
-
[11]
Release of Pre-Trained Models for the Japanese Language. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). https://arxiv.org/abs/2404.01657 Makoto Shing, Tianyu Zhao, and Kei Sawada. [n. d.]. rinna/japanese-clip-vit-b-16. https://huggingface.co/ rinna/japanese-clip...
-
[12]
Advances in neural information processing systems 35 (2022), 24824–24837
Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837. Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong
work page 2022
-
[13]
Advances in Neural Information Processing Systems 36 (2024)
Imagere- ward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems 36 (2024). Stephen Zakrewsky, Kamelia Aryafar, and Ali Shokoufandeh
work page 2024
-
[14]
Item Popularity Prediction in E-commerce Using Image Quality Feature Vectors
Item Popularity Prediction in E-commerce Using Image Quality Feature Vectors. CoRR abs/1605.03663 (2016). arXiv:1605.03663 http://arxiv.org/abs/1605. 03663 Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer
work page internal anchor Pith review Pith/arXiv arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.