pith. sign in

arxiv: 2408.11349 · v2 · submitted 2024-08-21 · 💻 cs.CV

Image Score: Learning and Evaluating Human Preferences for Mercari Search

Pith reviewed 2026-05-23 21:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords image aestheticsLLM labelinge-commerceweak supervisionsearch rankingonline experimentationuser behavior
0
0 comments X

The pith

LLM chain-of-thought prompts produce image aesthetics labels that align with Mercari user behavior and increase sales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method to use large language models for labeling the visual quality of product images in an online marketplace. By applying chain-of-thought reasoning, the model generates scores and explanations that match patterns in how users interact with search results. This labeling is less expensive than hiring people to rate images and provides transparency into why an image receives a particular score. When these scores were integrated into the live search ranking on Mercari, the platform recorded a significant increase in sales volume.

Core claim

An LLM equipped with chain-of-thought prompting can generate image aesthetics labels for e-commerce items that correlate with implicit user feedback such as clicks and purchases, and incorporating these labels into the ranking algorithm leads to higher sales in an online experiment.

What carries the argument

Chain-of-thought prompting on a large language model to generate explainable aesthetics scores for product images.

If this is right

  • Image quality assessment becomes feasible at scale without direct human annotation.
  • Search systems can optimize for visual appeal in addition to relevance.
  • Explanations from the model support debugging and customer experience improvements.
  • The approach serves as a low-cost way to test new labeling strategies before full deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar prompting techniques might work for other subjective judgments like item condition or trendiness.
  • If the LLM has biases from its training corpus, the labels could systematically favor certain styles of photography.
  • Further experiments could test whether the sales lift persists over longer periods or across different product categories.

Load-bearing premise

That the aesthetics labels generated by the LLM accurately reflect genuine human visual preferences rather than being shaped by the model's pretraining or the specific prompt wording.

What would settle it

Running a head-to-head comparison where the same set of images is rated both by the LLM and by human judges, then checking which set of labels better predicts actual user engagement metrics on the platform.

Figures

Figures reproduced from arXiv: 2408.11349 by Chingis Oinar, Miao Cao, Shanshan Fu.

Figure 1
Figure 1. Figure 1: The search result page user interface in the Mercari app. The figure shows how we display items to users, in a [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The data collection and processing pipeline. Price filtering and position windowing are applied to SERPs to [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The prompt for image aesthetic evaluation and image batch examples. The images with green borders are [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The score distributions for clicked items and not clicked items. In the chart on the right-hand side, the colors [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Most common adjectives from LLM analysis that are unique to clicked and not clicked items on the left and [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: We adopt the findings of visual perception proposed by CLIP-IQA. We find that CLIP embeddings are also [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: We adopt the pre-trained image encoder of CLIP as the backbone of our architecture, hence we only train an [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The Elasticsearch online indexing pipeline with the Image Score model. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: We show the effect of the proposed Image Score model by comparing the original ordering in the Search [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: We show the most common adjectives from LLM analysis for both clicked and not clicked items. As [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
read the original abstract

Mercari is the largest C2C e-commerce marketplace in Japan, having more than 20 million active monthly users. Search being the fundamental way to discover desired items, we have always had a substantial amount of data with implicit feedback. Although we actively take advantage of that to provide the best service for our users, the correlation of implicit feedback for such tasks as image quality assessment is not trivial. Many traditional lines of research in Machine Learning (ML) are similarly motivated by the insatiable appetite of Deep Learning (DL) models for well-labelled training data. Weak supervision is about leveraging higher-level and/or noisier supervision over unlabeled data. Large Language Models (LLMs) are being actively studied and used for data labelling tasks. We present how we leverage a Chain-of-Thought (CoT) to enable LLM to produce image aesthetics labels that correlate well with human behavior in e-commerce settings. Leveraging LLMs is more cost-effective compared to explicit human judgment, while significantly improving the explainability of deep image quality evaluation which is highly important for customer journey optimization at Mercari. We propose a cost-efficient LLM-driven approach for assessing and predicting image quality in e-commerce settings, which is very convenient for proof-of-concept testing. We show that our LLM-produced labels correlate with user behavior on Mercari. Finally, we show our results from an online experimentation, where we achieved a significant growth in sales on the web platform.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents an LLM-based approach using Chain-of-Thought prompting to generate image aesthetics labels for items on the Mercari C2C marketplace. It claims these labels correlate with user behavior on the platform and that deploying them in an online experiment produced significant sales growth.

Significance. If the empirical results are properly validated, the work demonstrates a scalable and cost-effective alternative to human labeling for image quality assessment in e-commerce search, with potential to improve ranking and user experience.

major comments (1)
  1. [Abstract and Online Experimentation] Abstract and results on online experimentation: the claim of 'significant growth in sales' from deploying the image scores provides no description of the A/B test design, primary metric (GMV, conversion, or CTR), statistical tests, sample sizes, power analysis, controls for concurrent ranking or UI changes, holdout period, or exclusion criteria. This prevents verification that the lift is causally attributable to the image scores rather than confounding factors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in our description of the online experiment. We address this point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and Online Experimentation] Abstract and results on online experimentation: the claim of 'significant growth in sales' from deploying the image scores provides no description of the A/B test design, primary metric (GMV, conversion, or CTR), statistical tests, sample sizes, power analysis, controls for concurrent ranking or UI changes, holdout period, or exclusion criteria. This prevents verification that the lift is causally attributable to the image scores rather than confounding factors.

    Authors: We agree that the current manuscript provides insufficient detail on the online experiment to allow independent verification of causality. In the revised version we will add a dedicated subsection (and expand the abstract) that specifies: the A/B test design and randomization procedure, the primary metric (GMV), sample sizes and power analysis, the statistical tests employed, controls for concurrent ranking or UI changes, holdout periods, and exclusion criteria. These additions will make explicit that the reported sales lift is attributable to the image-score intervention. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on observed correlations and A/B results

full rationale

The paper presents an empirical workflow: LLM-generated image aesthetics labels (via CoT prompting) are shown to correlate with Mercari user behavior, followed by an online experiment reporting sales growth. No equations, fitted parameters, or first-principles derivations appear. No step reduces a claimed prediction to a self-defined input, a fitted subset, or a self-citation chain. The central results are external benchmarks (user logs, A/B lift) rather than quantities defined in terms of themselves. This is the normal case of a non-circular empirical ML paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes LLMs can produce reliable aesthetics judgments via CoT without domain-specific fine-tuning.

axioms (1)
  • domain assumption Chain-of-thought prompting enables LLMs to generate image aesthetics labels that align with human preferences in e-commerce
    Invoked in the description of the labeling process and correlation claim

pith-pipeline@v0.9.0 · 5785 in / 1271 out tokens · 21955 ms · 2026-05-23T21:41:01.022219+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 5 internal anchors

  1. [1]

    GPT-4 Technical Report

    Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023). Javad Azimi, Ruofei Zhang, Yang Zhou, Vidhya Navalpakkam, Jianchang Mao, and Xiaoli Fern

  2. [2]

    The Impact of Visual Appearance on User Response in Online Display Advertising

    The Impact of Visual Appearance on User Response in Online Display Advertising. arXiv:1202.2158 [cs.HC] Fabiano Muniz Belém, Alexandre Maros, Sérgio D. Canuto, Rodrigo M. Silva, Jussara M. Almeida, and Marcos André Gonçalves

  3. [3]

    arXiv preprint arXiv:2305.10843 (2023)

    X-iqe: explainable image quality evaluation for text-to-image generation with visual large language models. arXiv preprint arXiv:2305.10843 (2023). Manri Cheon, Sung-Jun Yoon, Byungyeon Kang, and Junwoo Lee

  4. [4]

    In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006, Proceedings, Part III

    Studying aesthetics in photographic images using a computational approach. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006, Proceedings, Part III

  5. [5]

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár

    Handling position bias for unbiased learning to rank in hotels search.arXiv preprint arXiv:2002.12528 (2020). Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár

  6. [6]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016). Ilya Loshchilov and Frank Hutter

  7. [7]

    Decoupled Weight Decay Regularization

    Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017). Xiao Ma, Lina Mezghani, Kimberly Wilber, Hui Hong, Robinson Piramuthu, Mor Naaman, and Serge Belongie

  8. [8]

    In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV)

    Understanding Image Quality and Trust in Peer-to-Peer Marketplaces. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). 511–520. https://doi.org/10.1109/WACV.2019.00060 Zhaoqing Pan, Feng Yuan, Jianjun Lei, Yuming Fang, Xiao Shao, and Sam Kwong

  9. [9]

    IEEE Transactions on Image Processing31 (2022), 1613–1627

    VCRNet: Visual compensation restoration network for no-reference image quality assessment. IEEE Transactions on Image Processing31 (2022), 1613–1627. Richard Yuanzhe Pang, Stephen Roller, Kyunghyun Cho, He He, and Jason Weston

  10. [10]

    arXiv preprint arXiv:2307.14117 (2023)

    Leveraging implicit feedback from deployment data in dialogue. arXiv preprint arXiv:2307.14117 (2023). Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

  11. [11]

    In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

    Release of Pre-Trained Models for the Japanese Language. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). https://arxiv.org/abs/2404.01657 Makoto Shing, Tianyu Zhao, and Kei Sawada. [n. d.]. rinna/japanese-clip-vit-b-16. https://huggingface.co/ rinna/japanese-clip...

  12. [12]

    Advances in neural information processing systems 35 (2022), 24824–24837

    Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837. Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong

  13. [13]

    Advances in Neural Information Processing Systems 36 (2024)

    Imagere- ward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems 36 (2024). Stephen Zakrewsky, Kamelia Aryafar, and Ali Shokoufandeh

  14. [14]

    Item Popularity Prediction in E-commerce Using Image Quality Feature Vectors

    Item Popularity Prediction in E-commerce Using Image Quality Feature Vectors. CoRR abs/1605.03663 (2016). arXiv:1605.03663 http://arxiv.org/abs/1605. 03663 Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer