pith. sign in

arxiv: 2405.10498 · v2 · submitted 2024-05-17 · 💰 econ.GN · q-fin.EC

A Deep Learning Approach to Heterogeneous Consumer Aesthetics in Fast Fashion

Pith reviewed 2026-05-24 01:23 UTC · model grok-4.3

classification 💰 econ.GN q-fin.EC
keywords consumer demandfashion aestheticsdeep learning embeddingslatent class modelssubstitution patternshedonic pricingheterogeneitysustainability counterfactuals
0
0 comments X

The pith

Fine-tuned Fashion CLIP embeddings in a three-tower architecture feed a latent-class deep demand system that captures heterogeneous consumer aesthetics and substitution patterns from H&M purchases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a method to incorporate visual aesthetics into economic models of consumer demand using deep learning on large-scale retail data. It fine-tunes Fashion CLIP embeddings through separate channels for product visuals and text, consumer history, and price, then feeds the results into a latent-class deep demand system. The system models price and taste sensitivities, recovers substitution patterns and heterogeneity, and outperforms standard alternatives. It further enables supply-side analysis for markups and sustainability counterfactuals, plus improved hedonic pricing and event studies around events like COVID lockdowns. A sympathetic reader would care because aesthetics drive differentiation in fashion yet remain difficult to encode in formal choice models.

Core claim

By fine-tuning Fashion CLIP embeddings with a three-tower approach that builds separate channels for product visuals and text, consumer history, and price, the resulting embeddings feed a latent-class deep demand system. This system captures price and taste sensitivities through deep nets, recovers rich substitution patterns, reveals meaningful heterogeneity, and performs much better than competing alternatives. Supply-side inversion then recovers sensible markups and costs to support conduct tests and counterfactuals on sustainability practices, while machine learning hedonic models enable quality-adjusted price indices, pricing of new designs, and Oaxaca-Blinder decompositions of price变化.

What carries the argument

The three-tower fine-tuned Fashion CLIP embeddings that separate channels for product visuals and text, consumer history, and price, feeding the latent-class deep demand system.

If this is right

  • The supply-side inversion recovers sensible markups and costs that support conduct tests and counterfactuals on sustainability practices.
  • Machine learning hedonic pricing models perform much better than competing alternatives.
  • Quality-adjusted price indices can be constructed and completely new designs can be priced.
  • An Oaxaca-Blinder decomposition reveals the underlying sources of observed price changes.
  • A Poisson event study around the COVID-19 lockdown shows demand response ranges across embedding-based clusters that exceed those from text attributes or demographics alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The methodology could extend to other sensory-differentiated markets such as interior decor or hospitality where visual attributes matter but resist standard encoding.
  • The reported outperformance implies that adapting pre-trained vision-language models can make aesthetic heterogeneity tractable in demand estimation without major representational loss.
  • Sustainability counterfactuals enabled by the model could inform policy evaluation in fast fashion by isolating effects of design changes from observed consumer clusters.

Load-bearing premise

Fine-tuning Fashion CLIP via the three-tower architecture on product visuals, text, consumer history, and price produces embeddings that faithfully represent the aesthetic dimensions driving consumer choice without substantial information loss or bias from the pre-trained model.

What would settle it

If the latent-class deep demand system using these embeddings fails to outperform standard discrete choice models in out-of-sample prediction of purchases, cross-price elasticities, or substitution patterns on held-out H&M data, the central performance claim would be falsified.

Figures

Figures reproduced from arXiv: 2405.10498 by Pranjal Rawat.

Figure 1
Figure 1. Figure 1: Two synthetic dresses produced by a generative image model. Each dot is the [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Product assortment examples spanning price points and categories. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The modular estimation pipeline. Embeddings are fine-tuned once and then frozen [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training timeline. Embeddings are learned strictly on the [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Markup inversion pipeline. Observed prices, estimated demand (the Jacobian [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Individual-level price sensitivity (left) and the dispersion of taste scores across [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Own-price elasticity heterogeneity. Left, the distribution of [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Younger and higher-spending shoppers prefer visibly different shirts and shorts, and [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Younger and higher-spending shoppers prefer visibly different dresses. Columns [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Within-tier aesthetic ranking. Each row shows products that share the same [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Substitution structure at the cluster level. Left, the column-normalized four-by-four [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Cross-elasticity matrices for two consumer clusters obtained by [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Monthly aesthetic cluster rotation driven by the low-rank taste shift [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Implied costs and the revenue-versus-profit ranking. Left, mean implied marginal [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Sustainability prune for dresses, ordered by baseline inside share (smallest first). [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: The six synthetic dresses on which the two latent consumer classes disagree most [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Predicted fair-value dress prices by color and pattern. Cells with fewer than five [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Image-edit counterfactuals on two real H&M dresses. Each column applies one [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Chained price indices for dresses over the two-year window, with the [PITH_FULL_IMAGE:figures/full_fig_p037_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Fair-value tracking for five new dresses introduced only in the [PITH_FULL_IMAGE:figures/full_fig_p038_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Oaxaca-Blinder waterfall decomposition of the dress log-price change between [PITH_FULL_IMAGE:figures/full_fig_p040_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Dresses ranked by the per-product valuation effect from the Oaxaca-Blinder [PITH_FULL_IMAGE:figures/full_fig_p040_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Item-cluster view. Representative products from the two most extreme dress item [PITH_FULL_IMAGE:figures/full_fig_p045_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: User-cluster view. Lockdown effect per user embedding cluster (blue) and per [PITH_FULL_IMAGE:figures/full_fig_p046_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: User-cluster view. Recovery trajectories by user embedding cluster across the four [PITH_FULL_IMAGE:figures/full_fig_p047_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Item × user-cell view. Lockdown effect on daily dress transactions at the item￾cluster by user-cluster cell level, from per-cell M1 fits (Equation (24) fit once per (i, j) cell on that cell’s daily count). Each point is 100 · [exp(κˆLockdown) − 1] for one of the 132 cells; horizontal bars are 95% confidence intervals. Green markers flag cells with positive point estimates. The dashed vertical line and lig… view at source ↗
Figure 27
Figure 27. Figure 27: Item × user-cell view. Tail cross-cells from the per-cell M1 distribution in [PITH_FULL_IMAGE:figures/full_fig_p049_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Distribution of the M1 heterogeneity range across [PITH_FULL_IMAGE:figures/full_fig_p059_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: What the 64-dimensional item embedding looks like compared with the hand-coded [PITH_FULL_IMAGE:figures/full_fig_p060_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: A path through the 64-dimensional dress embedding space. Top: A PCA projection33 of all 5870 dress embeddings (gray dots). The nine interpolation dresses are highlighted as a colored path from blue (Dress A) to red (Dress B), with thumbnail images at each stop. Bottom: Starting from Dress A (a yellow printed dress, cosine = 1.00), each column shows the actual dress nearest to an evenly-spaced interpolatio… view at source ↗
Figure 32
Figure 32. Figure 32: Representative product images from the four clusters. The user tower encodes purchase behaviour orthogonal to observed demographics. The Adjusted Rand Index between embedding-based user clusters and age-based demographic segments is near zero (0.047).36 Standard demand models attribute whatever variation demographics fail to capture to unobserved taste shocks. The user embeddings capture this behaviour sy… view at source ↗
read the original abstract

Aesthetics drives product differentiation in industries such as fashion, interior decor, luxury goods, real estate and hospitality. However, visual differentiation is hard to encode in formal economic analysis. This paper analyses millions of purchase records from H\&M in the Netherlands, including product images, text descriptions, prices, and consumer demographics. I fine-tune Fashion CLIP embeddings with a three-tower approach that builds separate channels for product visuals and text, consumer history, and price, which makes downstream analysis tractable and scalable. The embeddings feed a latent-class deep demand system that captures price and taste sensitivities through deep nets, recovers rich substitution patterns, reveals meaningful heterogeneity, and performs much better than competing alternatives. Then, a supply-side inversion recovers sensible markups and costs and supports conduct tests and counterfactuals on sustainability practices. I also estimate machine learning hedonic pricing models that perform much better than competing alternatives. This model allows us to construct quality-adjusted price indices, make it possible to price completely new designs, and with an Oaxaca-Blinder decomposition reveal the underlying sources of price changes. Finally, a Poisson event study around the COVID-19 lockdown shows that the range of demand responses across embedding-based product and user clusters exceeds anything recoverable from simple text-based attributes or demographic labels alone. The methodology is portable to any market where products are differentiated along sensory dimensions that are hard to encode but meaningfully important for consumer choices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript develops a deep learning framework for modeling heterogeneous consumer aesthetics in fast fashion. It fine-tunes Fashion CLIP embeddings using a three-tower architecture on product visuals/text, consumer history, and price from millions of H&M Netherlands transactions. These embeddings are fed into a latent-class deep demand system that captures price and taste sensitivities via deep nets, recovers rich substitution patterns and heterogeneity, and is claimed to outperform alternatives. The paper further applies supply-side inversion to recover markups and costs, estimates ML hedonic pricing models for quality-adjusted indices and new-design pricing, performs an Oaxaca-Blinder decomposition, and conducts a Poisson event study around the COVID-19 lockdown showing larger demand response variation across embedding-based clusters than from text or demographics alone.

Significance. If the embeddings prove faithful and the performance claims hold under validation, the approach would meaningfully advance the incorporation of hard-to-encode visual and sensory differentiation into structural demand models, enabling better counterfactuals on pricing, sustainability, and indices in differentiated-goods markets such as fashion and luxury goods.

major comments (3)
  1. [Abstract] Abstract: the central claim that the latent-class deep demand system 'recovers rich substitution patterns, reveals meaningful heterogeneity, and performs much better than competing alternatives' is asserted without any reported validation details, baseline comparisons, error metrics, or robustness checks, rendering the performance advantage impossible to assess.
  2. [Abstract] Abstract/Methods description: the three-tower fine-tuning is asserted to produce embeddings that 'faithfully represent the aesthetic dimensions driving consumer choice' and make downstream analysis tractable, yet no tests, diagnostics, or comparisons are supplied to show that pre-training biases in Fashion CLIP are mitigated or that critical visual/textual signals survive integration with consumer history and price channels; this premise is load-bearing for all subsequent claims on substitution, heterogeneity, markups, and event-study results.
  3. [Abstract] Abstract: the supply inversion is said to 'recover sensible markups and costs and support conduct tests,' but no details on identification, instruments, or comparison to standard BLP-style approaches are provided, leaving the conduct-test validity unverified.
minor comments (2)
  1. [Abstract] Abstract: the sample is described only as 'millions of purchase records' without exact N, time span, or product-category coverage.
  2. [Abstract] Abstract: the phrase 'parameter-free' is never used, but the claim of scalability would benefit from explicit discussion of the number of latent classes and any tuning parameters in the deep nets.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate planned revisions to strengthen the presentation of validation and identification details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the latent-class deep demand system 'recovers rich substitution patterns, reveals meaningful heterogeneity, and performs much better than competing alternatives' is asserted without any reported validation details, baseline comparisons, error metrics, or robustness checks, rendering the performance advantage impossible to assess.

    Authors: We agree that the abstract would benefit from explicit references to the validation results. The full manuscript reports these in Sections 4.3 (out-of-sample fit comparisons to BLP, nested logit, and neural network baselines) and 5.1 (substitution matrix recovery and heterogeneity diagnostics), including RMSE, hit rates, and robustness to alternative embeddings. In the revision we will condense key metrics into the abstract to make the performance claims directly assessable from the abstract alone. revision: yes

  2. Referee: [Abstract] Abstract/Methods description: the three-tower fine-tuning is asserted to produce embeddings that 'faithfully represent the aesthetic dimensions driving consumer choice' and make downstream analysis tractable, yet no tests, diagnostics, or comparisons are supplied to show that pre-training biases in Fashion CLIP are mitigated or that critical visual/textual signals survive integration with consumer history and price channels; this premise is load-bearing for all subsequent claims on substitution, heterogeneity, markups, and event-study results.

    Authors: The three-tower architecture and training objective are described in Section 3.2. We acknowledge that additional diagnostics would strengthen the claim. In the revision we will add (i) cosine-similarity and retrieval-precision comparisons between original Fashion CLIP and fine-tuned embeddings on held-out aesthetic attributes, (ii) ablation results showing the incremental contribution of the consumer-history and price towers, and (iii) a short discussion of how the contrastive loss mitigates known Fashion CLIP biases. These will be placed in a new subsection of Section 3. revision: yes

  3. Referee: [Abstract] Abstract: the supply inversion is said to 'recover sensible markups and costs and support conduct tests,' but no details on identification, instruments, or comparison to standard BLP-style approaches are provided, leaving the conduct-test validity unverified.

    Authors: Section 6.1 presents the inversion and reports markup distributions, but we agree that a more explicit identification argument and instrument list are needed. In the revision we will expand this section to (i) state the identifying assumptions (cost shifters and rival characteristics as in BLP 1995), (ii) list the exact instruments employed, and (iii) add a side-by-side comparison of recovered markups and conduct-test statistics against a standard random-coefficients BLP specification estimated on the same data. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper describes an empirical pipeline that begins with external H&M transaction records (images, text, prices, demographics) and a pre-trained Fashion CLIP model, applies a three-tower fine-tuning step, and then feeds the resulting embeddings into a latent-class deep demand system for estimation of substitution patterns and heterogeneity. No equations, self-citations, or fitted-parameter renamings are shown that would make any claimed prediction (rich substitution, markups, hedonic indices, or event-study responses) equivalent to its inputs by construction. All steps rely on independent data sources and external performance benchmarks, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review prevents exhaustive identification; the central modeling step rests on the assumption that the fine-tuned embeddings preserve relevant aesthetic information.

free parameters (1)
  • number of latent classes
    Latent-class models require selecting or fitting the number of consumer segments, which is typically data-driven.
axioms (1)
  • domain assumption Fine-tuned Fashion CLIP embeddings via three-tower architecture capture the aesthetic features that drive consumer choice.
    This premise enables the downstream demand system and heterogeneity claims.

pith-pipeline@v0.9.0 · 5775 in / 1183 out tokens · 43760 ms · 2026-05-24T01:23:53.490991+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    Representing random utility choice models with neural networks

    Ali Aouad and Antoine Désir. Representing random utility choice models with neural networks. arXiv preprint arXiv:2207.12877,

  2. [2]

    Patrick Bajari, Zhihao Cen, Victor Chernozhukov, Manoj Huber, Nikita Manziuk, Nicola Pavanini, and Suhas Wan

    arXiv:2501.00382. Patrick Bajari, Zhihao Cen, Victor Chernozhukov, Manoj Huber, Nikita Manziuk, Nicola Pavanini, and Suhas Wan. Hedonic prices and quality adjusted price indices powered by AI.Journal of Econometrics,

  3. [3]

    Inference for regression with variables generated by ai or machine learning.arXiv preprint arXiv:2402.15585,

    Laura Battaglia, Timothy Christensen, Stephen Hansen, and Szymon Sacher. Inference for regression with variables generated by ai or machine learning.arXiv preprint arXiv:2402.15585,

  4. [4]

    Christopher Conlon and Jeff Gortmaker

    arXiv:2503.20711. Christopher Conlon and Jeff Gortmaker. Best practices for differentiated products demand estimation with PyBLP.The RAND Journal of Economics, 51(4):1108–1161,

  5. [5]

    Vivian de Kok

    arXiv:2008.07178. Vivian de Kok. Fast fashion: An insight in the most important attributes while buying fast fashion by students from the Erasmus University. Master’s thesis, Erasmus School of Economics, Erasmus University Rotterdam,

  6. [6]

    BERT: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4171–4186,

  7. [7]

    Connor Lennon, Edward Rubin, and Glen Waddell

    URLhttps://essay.utwente.nl/79038/. Connor Lennon, Edward Rubin, and Glen Waddell. Machine learning the first stage in 2sls: Practical guidance from bias decomposition and simulation.arXiv preprint arXiv:2505.13422,

  8. [8]

    Zimmermann, and Wieland Brendel

    Evgenia Rusak, Patrik Reizinger, Attila Juhos, Oliver Bringmann, Roland S. Zimmermann, and Wieland Brendel. InfoNCE: Identifying the gap between theory and practice.arXiv preprint arXiv:2407.00143,

  9. [9]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748,