LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval

Chao Gao; Fan Zhou; Gensmo.ai; Jiwen Fu; Shanshan Li; Siqiao Xue; Tingyi Gu

arxiv: 2601.14706 · v3 · submitted 2026-01-21 · 💻 cs.CV

LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval

Gensmo.ai , Chao Gao , Siqiao Xue , Jiwen Fu , Tingyi Gu , Shanshan Li , Fan Zhou This is my paper

Pith reviewed 2026-05-16 12:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords fashion image retrievalbenchmarke-commerceAI-generated imagesoutfit retrievalRecall@1live datasetcontamination-aware evaluation

0 comments

The pith

LookBench provides a live, updatable benchmark using current e-commerce and AI-generated images to test fashion retrieval models on single-item and outfit tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes LookBench as a dynamic evaluation set for fashion image retrieval that draws directly from live websites and recent AI-generated samples to match real shopping scenarios. It structures retrieval around a fine-grained attribute taxonomy and supports both exact item matches and outfit-level consistency checks. The benchmark timestamps every sample and commits to semi-annual refreshes so that training data cutoffs can be respected and contamination avoided. Experiments demonstrate that strong existing models drop below 60% Recall@1 on this data, while the authors' proprietary and open-source models lead the new leaderboard yet still transfer effectively to older Fashion200K results.

Core claim

LookBench is a live and holistic benchmark that combines time-stamped product images from active websites with AI-generated fashion samples, organized by a fine-grained attribute taxonomy for both single-item and outfit-level retrieval, and is structured for periodic updates to maintain alignment with declared training cutoffs.

What carries the argument

Live time-stamped dataset with fine-grained attribute taxonomy supporting single-item and outfit retrieval tasks.

If this is right

Periodic benchmark updates will require models to handle evolving fashion trends without relying on leaked test data.
Strong performance on both live and AI-generated images becomes a necessary condition for practical e-commerce deployment.
Open release of the second-place model supplies a reproducible baseline that future work can directly compare against.
Joint evaluation on legacy Fashion200K and the new benchmark separates generalization from memorization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adoption of time-stamped live benchmarks could become standard practice for any retrieval domain where styles change rapidly.
The dual real-plus-generated construction offers a natural testbed for studying domain shift between photographed and synthesized images.
If rankings remain stable across updates, the benchmark would provide a durable, contamination-resistant progress metric for the field.

Load-bearing premise

The selected live website images and AI-generated samples accurately represent typical e-commerce fashion retrieval without introducing selection bias or new artifacts.

What would settle it

A controlled test showing that models trained only on pre-2023 data achieve above 80% Recall@1 on the current LookBench release would falsify the claim that the benchmark poses a significant new challenge.

read the original abstract

In this paper, we present LookBench (We use the term "look" to reflect retrieval that mirrors how people shop -- finding the exact item, a close substitute, or a visually consistent alternative.), a live, holistic and challenging benchmark for fashion image retrieval in real e-commerce settings. LookBench includes both recent product images sourced from live websites and AI-generated fashion images, reflecting contemporary trends and use cases. Each test sample is time-stamped and we intend to update the benchmark periodically, enabling contamination-aware evaluation aligned with declared training cutoffs. Grounded in our fine-grained attribute taxonomy, LookBench covers single-item and outfit-level retrieval across. Our experiments reveal that LookBench poses a significant challenge on strong baselines, with many models achieving below $60\%$ Recall@1. Our proprietary model achieves the best performance on LookBench, and we release an open-source counterpart that ranks second, with both models attaining state-of-the-art results on legacy Fashion200K evaluations. LookBench is designed to be updated semi-annually with new test samples and progressively harder task variants, providing a durable measure of progress. We publicly release our leaderboard, dataset, evaluation code, and trained models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LookBench is a live time-stamped benchmark mixing real and AI fashion images with open releases that usefully highlights contamination issues, though its difficulty claims need stronger checks on sample representativeness.

read the letter

LookBench is a live benchmark that pulls recent product images from websites plus AI-generated ones, time-stamps every sample, and plans semi-annual updates so models can be tested against declared training cutoffs. It also covers both single-item and outfit-level retrieval under a fine-grained attribute taxonomy. The main practical move is the full open release of the dataset, evaluation code, leaderboard, and two models (proprietary best, open-source second) that still hit SOTA on the old Fashion200K set while dropping below 60% Recall@1 on LookBench. That contrast is the clearest evidence they offer for why static benchmarks are getting stale in e-commerce settings. The releases themselves are concrete and immediately usable, which is the part that actually moves the field forward. The softer spot is the lack of visible validation that the chosen images match typical e-commerce distributions. Without attribute histograms, visual complexity stats, or a clear random-sampling protocol against broader retailer data, it is hard to rule out that the performance drop comes partly from curation choices rather than genuine task hardness. The stress-test note on selection bias holds up here because the abstract and results do not show those checks. This paper is for people building or evaluating retrieval models in fashion and e-commerce who need something fresher than Fashion200K. Anyone running experiments on generalization to current trends will find the released resources directly useful. It deserves peer review because the releases give referees something concrete to examine and the core idea of a contamination-aware, updateable benchmark is worth a public discussion even if the representativeness argument needs tightening.

Referee Report

1 major / 2 minor

Summary. The paper introduces LookBench, a live and periodically updatable benchmark for fashion image retrieval that combines recent live-website product images with AI-generated samples. It defines single-item and outfit-level retrieval tasks over a fine-grained attribute taxonomy, time-stamps all test samples to support contamination-aware evaluation, and reports that strong baselines fall below 60% Recall@1 while the authors' proprietary model leads and an open-source counterpart ranks second; both also achieve SOTA on Fashion200K. The benchmark, leaderboard, dataset, evaluation code, and models are released publicly with plans for semi-annual updates.

Significance. If the representativeness claims hold, LookBench would supply a durable, contamination-resistant yardstick for fashion retrieval that better reflects current e-commerce trends and multi-item outfits than static legacy sets. The explicit release of code, models, and an open-source runner-up model is a concrete strength that lowers the barrier for follow-on work.

major comments (1)

[Dataset construction] Dataset construction section: the claim that LookBench 'reflects contemporary trends and use cases' and poses a 'significant challenge' because many models score below 60% Recall@1 rests on the unvalidated assumption that the chosen live-site and AI-generated samples are representative of typical e-commerce distributions. No attribute-frequency histograms, visual-complexity statistics, or random-sampling protocol against broader retailer corpora are supplied, leaving open the possibility that the observed performance drop is partly an artifact of curation rather than intrinsic task hardness.

minor comments (2)

[Abstract / Introduction] The abstract and introduction use 'holistic' without a precise operational definition; a short paragraph clarifying which dimensions (attribute granularity, outfit coherence, temporal freshness) are covered would help readers map the benchmark to their own use cases.
[Experiments] Table or figure reporting per-model Recall@1 on LookBench should include the exact number of queries and the retrieval gallery size so that the <60% figures can be interpreted in context.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and have revised the paper to incorporate additional validation of the dataset construction process.

read point-by-point responses

Referee: Dataset construction section: the claim that LookBench 'reflects contemporary trends and use cases' and poses a 'significant challenge' because many models score below 60% Recall@1 rests on the unvalidated assumption that the chosen live-site and AI-generated samples are representative of typical e-commerce distributions. No attribute-frequency histograms, visual-complexity statistics, or random-sampling protocol against broader retailer corpora are supplied, leaving open the possibility that the observed performance drop is partly an artifact of curation rather than intrinsic task hardness.

Authors: We acknowledge that explicit statistical validation would strengthen the representativeness argument. In the revised manuscript we have added attribute-frequency histograms for the fine-grained taxonomy (color, pattern, category, etc.) comparing LookBench samples against Fashion200K and a random sample of 10k images drawn from the same live retailer sources used for the benchmark. We have also included a description of the sampling protocol: live images were drawn uniformly at random from product pages of major fashion retailers with post-2023 timestamps, while AI-generated images were produced from prompts conditioned on the same attribute distribution. These additions show that LookBench exhibits higher visual complexity and more recent style distributions than legacy sets, supporting that the sub-60% Recall@1 scores reflect genuine task difficulty rather than curation artifacts. We retain the original performance numbers and leaderboard results unchanged. revision: yes

Circularity Check

0 steps flagged

No significant circularity: benchmark presentation with empirical results only

full rationale

The paper introduces LookBench as a new dataset and evaluation protocol for fashion retrieval, reports baseline and model performances on it, and notes SOTA on legacy Fashion200K. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on empirical measurements against the released benchmark rather than any closed-loop reduction of outputs to the paper's own inputs or prior self-citations. The selection of live and AI-generated samples is presented as a design choice without any equation or theorem that would make the reported difficulty circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a benchmark construction paper; the central claims rest on domain assumptions about what constitutes realistic fashion retrieval tasks and the absence of bias in image selection and taxonomy design. No free parameters, invented entities, or non-standard axioms are introduced.

pith-pipeline@v0.9.0 · 5525 in / 1155 out tokens · 34744 ms · 2026-05-16T12:55:43.580241+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We train GR-Pro ... with an additive angular-margin softmax loss—ArcFace loss ... ℓi =−log exp(s(cos(m1θi,yi+m2)−m3)) / (exp(s(cos(m1θi,yi+m2)−m3)) + Σj≠yi exp(scosθi,j))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.