Image Score: Learning and Evaluating Human Preferences for Mercari Search

Chingis Oinar; Miao Cao; Shanshan Fu

arxiv: 2408.11349 · v2 · submitted 2024-08-21 · 💻 cs.CV

Image Score: Learning and Evaluating Human Preferences for Mercari Search

Chingis Oinar , Miao Cao , Shanshan Fu This is my paper

Pith reviewed 2026-05-23 21:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords image aestheticsLLM labelinge-commerceweak supervisionsearch rankingonline experimentationuser behavior

0 comments

The pith

LLM chain-of-thought prompts produce image aesthetics labels that align with Mercari user behavior and increase sales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method to use large language models for labeling the visual quality of product images in an online marketplace. By applying chain-of-thought reasoning, the model generates scores and explanations that match patterns in how users interact with search results. This labeling is less expensive than hiring people to rate images and provides transparency into why an image receives a particular score. When these scores were integrated into the live search ranking on Mercari, the platform recorded a significant increase in sales volume.

Core claim

An LLM equipped with chain-of-thought prompting can generate image aesthetics labels for e-commerce items that correlate with implicit user feedback such as clicks and purchases, and incorporating these labels into the ranking algorithm leads to higher sales in an online experiment.

What carries the argument

Chain-of-thought prompting on a large language model to generate explainable aesthetics scores for product images.

If this is right

Image quality assessment becomes feasible at scale without direct human annotation.
Search systems can optimize for visual appeal in addition to relevance.
Explanations from the model support debugging and customer experience improvements.
The approach serves as a low-cost way to test new labeling strategies before full deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar prompting techniques might work for other subjective judgments like item condition or trendiness.
If the LLM has biases from its training corpus, the labels could systematically favor certain styles of photography.
Further experiments could test whether the sales lift persists over longer periods or across different product categories.

Load-bearing premise

That the aesthetics labels generated by the LLM accurately reflect genuine human visual preferences rather than being shaped by the model's pretraining or the specific prompt wording.

What would settle it

Running a head-to-head comparison where the same set of images is rated both by the LLM and by human judges, then checking which set of labels better predicts actual user engagement metrics on the platform.

Figures

Figures reproduced from arXiv: 2408.11349 by Chingis Oinar, Miao Cao, Shanshan Fu.

**Figure 2.** Figure 2: The data collection and processing pipeline. Price filtering and position windowing are applied to SERPs to [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The prompt for image aesthetic evaluation and image batch examples. The images with green borders are [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The score distributions for clicked items and not clicked items. In the chart on the right-hand side, the colors [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Most common adjectives from LLM analysis that are unique to clicked and not clicked items on the left and [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: We adopt the findings of visual perception proposed by CLIP-IQA. We find that CLIP embeddings are also [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: We adopt the pre-trained image encoder of CLIP as the backbone of our architecture, hence we only train an [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: The Elasticsearch online indexing pipeline with the Image Score model. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: We show the effect of the proposed Image Score model by comparing the original ordering in the Search [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: We show the most common adjectives from LLM analysis for both clicked and not clicked items. As [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

read the original abstract

Mercari is the largest C2C e-commerce marketplace in Japan, having more than 20 million active monthly users. Search being the fundamental way to discover desired items, we have always had a substantial amount of data with implicit feedback. Although we actively take advantage of that to provide the best service for our users, the correlation of implicit feedback for such tasks as image quality assessment is not trivial. Many traditional lines of research in Machine Learning (ML) are similarly motivated by the insatiable appetite of Deep Learning (DL) models for well-labelled training data. Weak supervision is about leveraging higher-level and/or noisier supervision over unlabeled data. Large Language Models (LLMs) are being actively studied and used for data labelling tasks. We present how we leverage a Chain-of-Thought (CoT) to enable LLM to produce image aesthetics labels that correlate well with human behavior in e-commerce settings. Leveraging LLMs is more cost-effective compared to explicit human judgment, while significantly improving the explainability of deep image quality evaluation which is highly important for customer journey optimization at Mercari. We propose a cost-efficient LLM-driven approach for assessing and predicting image quality in e-commerce settings, which is very convenient for proof-of-concept testing. We show that our LLM-produced labels correlate with user behavior on Mercari. Finally, we show our results from an online experimentation, where we achieved a significant growth in sales on the web platform.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This applies established LLM chain-of-thought labeling to Mercari image aesthetics with a sales-growth claim, but the abstract supplies no metrics, test design, or controls to support it.

read the letter

The core of this paper is a straightforward application of LLM chain-of-thought prompting to generate image aesthetics labels from Mercari's implicit feedback data. It positions the method as cheaper than human labeling and more explainable for search ranking tweaks. That framing matches real constraints at a large C2C marketplace and shows how weak supervision can be slotted into an existing pipeline without new model training. The authors correctly note that traditional image quality signals often fail to align with actual user clicks and purchases in this setting. Those points are useful for anyone running similar search systems. The online experiment is presented as the payoff, with a reported sales lift after deploying the scores. However, the abstract gives no numbers on correlation strength, no A/B test parameters, no primary metric, no holdout details, and no mention of whether other ranking changes were frozen. That absence makes the causality claim impossible to evaluate from the given text. The weakest assumption—that LLM labels track genuine human preferences without prompt or training-data bias—is stated but not tested or bounded. This work is aimed at industry ML teams that need quick labeling recipes for e-commerce image tasks. Academic readers or those wanting new methods or reproducible results will not find much. I would not send it for peer review in its current state; the experiment section needs concrete design and outcome details before any referee could assess whether the sales result holds.

Referee Report

1 major / 0 minor

Summary. The paper presents an LLM-based approach using Chain-of-Thought prompting to generate image aesthetics labels for items on the Mercari C2C marketplace. It claims these labels correlate with user behavior on the platform and that deploying them in an online experiment produced significant sales growth.

Significance. If the empirical results are properly validated, the work demonstrates a scalable and cost-effective alternative to human labeling for image quality assessment in e-commerce search, with potential to improve ranking and user experience.

major comments (1)

[Abstract and Online Experimentation] Abstract and results on online experimentation: the claim of 'significant growth in sales' from deploying the image scores provides no description of the A/B test design, primary metric (GMV, conversion, or CTR), statistical tests, sample sizes, power analysis, controls for concurrent ranking or UI changes, holdout period, or exclusion criteria. This prevents verification that the lift is causally attributable to the image scores rather than confounding factors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in our description of the online experiment. We address this point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and Online Experimentation] Abstract and results on online experimentation: the claim of 'significant growth in sales' from deploying the image scores provides no description of the A/B test design, primary metric (GMV, conversion, or CTR), statistical tests, sample sizes, power analysis, controls for concurrent ranking or UI changes, holdout period, or exclusion criteria. This prevents verification that the lift is causally attributable to the image scores rather than confounding factors.

Authors: We agree that the current manuscript provides insufficient detail on the online experiment to allow independent verification of causality. In the revised version we will add a dedicated subsection (and expand the abstract) that specifies: the A/B test design and randomization procedure, the primary metric (GMV), sample sizes and power analysis, the statistical tests employed, controls for concurrent ranking or UI changes, holdout periods, and exclusion criteria. These additions will make explicit that the reported sales lift is attributable to the image-score intervention. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on observed correlations and A/B results

full rationale

The paper presents an empirical workflow: LLM-generated image aesthetics labels (via CoT prompting) are shown to correlate with Mercari user behavior, followed by an online experiment reporting sales growth. No equations, fitted parameters, or first-principles derivations appear. No step reduces a claimed prediction to a self-defined input, a fitted subset, or a self-citation chain. The central results are external benchmarks (user logs, A/B lift) rather than quantities defined in terms of themselves. This is the normal case of a non-circular empirical ML paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes LLMs can produce reliable aesthetics judgments via CoT without domain-specific fine-tuning.

axioms (1)

domain assumption Chain-of-thought prompting enables LLMs to generate image aesthetics labels that align with human preferences in e-commerce
Invoked in the description of the labeling process and correlation claim

pith-pipeline@v0.9.0 · 5785 in / 1271 out tokens · 21955 ms · 2026-05-23T21:41:01.022219+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 5 internal anchors

[1]

GPT-4 Technical Report

Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023). Javad Azimi, Ruofei Zhang, Yang Zhou, Vidhya Navalpakkam, Jianchang Mao, and Xiaoli Fern

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

The Impact of Visual Appearance on User Response in Online Display Advertising

The Impact of Visual Appearance on User Response in Online Display Advertising. arXiv:1202.2158 [cs.HC] Fabiano Muniz Belém, Alexandre Maros, Sérgio D. Canuto, Rodrigo M. Silva, Jussara M. Almeida, and Marcos André Gonçalves

work page internal anchor Pith review Pith/arXiv arXiv
[3]

arXiv preprint arXiv:2305.10843 (2023)

X-iqe: explainable image quality evaluation for text-to-image generation with visual large language models. arXiv preprint arXiv:2305.10843 (2023). Manri Cheon, Sung-Jun Yoon, Byungyeon Kang, and Junwoo Lee

work page arXiv 2023
[4]

In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006, Proceedings, Part III

Studying aesthetics in photographic images using a computational approach. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006, Proceedings, Part III

work page 2006
[5]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár

Handling position bias for unbiased learning to rank in hotels search.arXiv preprint arXiv:2002.12528 (2020). Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár

work page arXiv 2002
[6]

SGDR: Stochastic Gradient Descent with Warm Restarts

Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016). Ilya Loshchilov and Frank Hutter

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017). Xiao Ma, Lina Mezghani, Kimberly Wilber, Hui Hong, Robinson Piramuthu, Mor Naaman, and Serge Belongie

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV)

Understanding Image Quality and Trust in Peer-to-Peer Marketplaces. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). 511–520. https://doi.org/10.1109/WACV.2019.00060 Zhaoqing Pan, Feng Yuan, Jianjun Lei, Yuming Fang, Xiao Shao, and Sam Kwong

work page doi:10.1109/wacv.2019.00060 2019
[9]

IEEE Transactions on Image Processing31 (2022), 1613–1627

VCRNet: Visual compensation restoration network for no-reference image quality assessment. IEEE Transactions on Image Processing31 (2022), 1613–1627. Richard Yuanzhe Pang, Stephen Roller, Kyunghyun Cho, He He, and Jason Weston

work page 2022
[10]

arXiv preprint arXiv:2307.14117 (2023)

Leveraging implicit feedback from deployment data in dialogue. arXiv preprint arXiv:2307.14117 (2023). Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

work page arXiv 2023
[11]

In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Release of Pre-Trained Models for the Japanese Language. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). https://arxiv.org/abs/2404.01657 Makoto Shing, Tianyu Zhao, and Kei Sawada. [n. d.]. rinna/japanese-clip-vit-b-16. https://huggingface.co/ rinna/japanese-clip...

work page arXiv 2024
[12]

Advances in neural information processing systems 35 (2022), 24824–24837

Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837. Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong

work page 2022
[13]

Advances in Neural Information Processing Systems 36 (2024)

Imagere- ward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems 36 (2024). Stephen Zakrewsky, Kamelia Aryafar, and Ali Shokoufandeh

work page 2024
[14]

Item Popularity Prediction in E-commerce Using Image Quality Feature Vectors

Item Popularity Prediction in E-commerce Using Image Quality Feature Vectors. CoRR abs/1605.03663 (2016). arXiv:1605.03663 http://arxiv.org/abs/1605. 03663 Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

GPT-4 Technical Report

Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023). Javad Azimi, Ruofei Zhang, Yang Zhou, Vidhya Navalpakkam, Jianchang Mao, and Xiaoli Fern

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

The Impact of Visual Appearance on User Response in Online Display Advertising

The Impact of Visual Appearance on User Response in Online Display Advertising. arXiv:1202.2158 [cs.HC] Fabiano Muniz Belém, Alexandre Maros, Sérgio D. Canuto, Rodrigo M. Silva, Jussara M. Almeida, and Marcos André Gonçalves

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

arXiv preprint arXiv:2305.10843 (2023)

X-iqe: explainable image quality evaluation for text-to-image generation with visual large language models. arXiv preprint arXiv:2305.10843 (2023). Manri Cheon, Sung-Jun Yoon, Byungyeon Kang, and Junwoo Lee

work page arXiv 2023

[4] [4]

In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006, Proceedings, Part III

Studying aesthetics in photographic images using a computational approach. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006, Proceedings, Part III

work page 2006

[5] [5]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár

Handling position bias for unbiased learning to rank in hotels search.arXiv preprint arXiv:2002.12528 (2020). Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár

work page arXiv 2002

[6] [6]

SGDR: Stochastic Gradient Descent with Warm Restarts

Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016). Ilya Loshchilov and Frank Hutter

work page internal anchor Pith review Pith/arXiv arXiv 2016

[7] [7]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017). Xiao Ma, Lina Mezghani, Kimberly Wilber, Hui Hong, Robinson Piramuthu, Mor Naaman, and Serge Belongie

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV)

Understanding Image Quality and Trust in Peer-to-Peer Marketplaces. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). 511–520. https://doi.org/10.1109/WACV.2019.00060 Zhaoqing Pan, Feng Yuan, Jianjun Lei, Yuming Fang, Xiao Shao, and Sam Kwong

work page doi:10.1109/wacv.2019.00060 2019

[9] [9]

IEEE Transactions on Image Processing31 (2022), 1613–1627

VCRNet: Visual compensation restoration network for no-reference image quality assessment. IEEE Transactions on Image Processing31 (2022), 1613–1627. Richard Yuanzhe Pang, Stephen Roller, Kyunghyun Cho, He He, and Jason Weston

work page 2022

[10] [10]

arXiv preprint arXiv:2307.14117 (2023)

Leveraging implicit feedback from deployment data in dialogue. arXiv preprint arXiv:2307.14117 (2023). Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

work page arXiv 2023

[11] [11]

In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Release of Pre-Trained Models for the Japanese Language. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). https://arxiv.org/abs/2404.01657 Makoto Shing, Tianyu Zhao, and Kei Sawada. [n. d.]. rinna/japanese-clip-vit-b-16. https://huggingface.co/ rinna/japanese-clip...

work page arXiv 2024

[12] [12]

Advances in neural information processing systems 35 (2022), 24824–24837

Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837. Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong

work page 2022

[13] [13]

Advances in Neural Information Processing Systems 36 (2024)

Imagere- ward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems 36 (2024). Stephen Zakrewsky, Kamelia Aryafar, and Ali Shokoufandeh

work page 2024

[14] [14]

Item Popularity Prediction in E-commerce Using Image Quality Feature Vectors

Item Popularity Prediction in E-commerce Using Image Quality Feature Vectors. CoRR abs/1605.03663 (2016). arXiv:1605.03663 http://arxiv.org/abs/1605. 03663 Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer

work page internal anchor Pith review Pith/arXiv arXiv 2016