pith. sign in

arxiv: 2605.09827 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Fashion Florence: Fine-Tuning Florence-2 for Structured Fashion Attribute Extraction

Pith reviewed 2026-05-12 05:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords fashion attribute extractionLoRA fine-tuningvision-language modelstructured JSON outputFlorence-2iMaterialist Fashion datasetclothing image analysis
0
0 comments X

The pith

Fine-tuning Florence-2 on collapsed iMaterialist labels produces structured JSON fashion attributes with 94.6% category accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates how to adapt Florence-2, a vision-language model, to clothing images by first collapsing the iMaterialist Fashion dataset's 228 fine-grained labels into a compact schema of 6 categories, 16 colors, and 19 styles through rule-based engineering. LoRA is then applied to the decoder layers and the model is trained for three epochs on 3,688 examples so that a single photograph yields a ready-to-use JSON object listing category, color, material, style, and occasion. This approach delivers higher accuracy than zero-shot calls to GPT-4o-mini or Gemini 2.5 Flash while emitting syntactically valid JSON in 99.8 percent of cases. The resulting 0.77-billion-parameter model runs on a single GPU at negligible extra cost and has already been deployed inside an open-source outfit recommender. The work shows that targeted, data-efficient adaptation of an open model can outperform much larger general-purpose systems on a narrow extraction task whose outputs are directly consumable by downstream software.

Core claim

Fine-tuning Florence-2 with LoRA (r=16, alpha=32) on 3,688 images whose labels have been rule-collapsed from the iMaterialist Fashion dataset into a 6-category, 16-color, 19-style schema yields a model that, given one clothing photograph, emits a JSON object containing category, color, material, style, and occasion tags. On a held-out test set of 461 images the model records 94.6% category accuracy, 63.0% material accuracy, and 0.753 style-tag F1, exceeding the corresponding figures for GPT-4o-mini and Gemini 2.5 Flash, while producing valid JSON in 99.8% of outputs.

What carries the argument

LoRA adaptation of all decoder linear layers in Florence-2, trained to map images to a fixed compact JSON schema obtained by rule-based collapse of the original 228 iMaterialist labels.

If this is right

  • The 0.77B model can be run on a single GPU with zero marginal inference cost beyond loading the base weights.
  • Valid JSON output in 99.8% of cases allows direct piping into recommendation, retrieval, or inventory systems without additional parsing logic.
  • Style and material extraction F1 scores exceed those of much larger proprietary models, showing that domain-specific adaptation improves narrow attribute accuracy.
  • The same fine-tuned weights have already been integrated into an open-source outfit recommender called Loom.
  • Because only decoder linear layers receive LoRA updates, the adaptation remains lightweight and reproducible on modest hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same label-collapsing plus LoRA recipe could be reused for other visual domains that require structured attribute extraction, such as furniture or product photography.
  • For repetitive, narrow-scope extraction tasks, a small domain-adapted model may deliver a better accuracy-to-cost ratio than repeated calls to frontier general models.
  • If the schema collapse proves unbiased on new data, the method supplies a practical template for turning any large attribute-rich dataset into a deployable structured extractor.

Load-bearing premise

The rule-based collapse of 228 fine-grained labels into the compact 6-category, 16-color, 19-style schema preserves semantic distinctions without introducing systematic bias or information loss.

What would settle it

Human re-annotation of the 461 test images using the collapsed schema, followed by re-scoring the model against those fresh labels, would reveal whether the reported accuracies depend on artifacts of the original label mapping.

read the original abstract

We present Fashion Florence, a Florence-2 vision-language model fine-tuned with LoRA to extract structured fashion attributes from clothing images. Given a single photograph, the model generates a JSON object containing category, color, material, style tags, and occasion tags, structured output suitable for direct programmatic consumption by downstream recommendation and retrieval systems. Fine-tuning data is derived from the iMaterialist Fashion dataset (228 labels), where we collapse fine-grained annotations into a compact 6-category, 16-color, 19-style schema via rule-based label engineering. We apply LoRA (r=16, alpha=32) to all decoder linear layers, training for 3 epochs on 3,688 examples. On a held-out test set of 461 images, Fashion Florence achieves 94.6% category accuracy and 63.0% material accuracy, compared to 89.3% / 43.3% for GPT-4o-mini and 87.4% for Gemini 2.5 Flash. Fashion Florence produces valid JSON in 99.8% of outputs while running at 0.77B parameters on a single GPU at zero marginal inference cost. Style tag F1 reaches 0.753 vs. 0.612 (Gemini) and 0.398 (GPT-4o-mini). The model is deployed as a Hugging Face Space and integrated into Loom, an open-source outfit recommendation system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Fashion Florence, a Florence-2 vision-language model fine-tuned with LoRA to extract structured fashion attributes (category, color, material, style, occasion) from single clothing images and output them as JSON. Fine-tuning data is created by collapsing the 228 fine-grained labels from the iMaterialist Fashion dataset into a compact schema (6 categories, 16 colors, 19 styles) via rule-based label engineering. Training uses LoRA (r=16, alpha=32) on decoder layers for 3 epochs on 3,688 examples. On a held-out test set of 461 images, the model reports 94.6% category accuracy, 63.0% material accuracy, 0.753 style F1, and 99.8% valid JSON outputs, outperforming GPT-4o-mini and Gemini 2.5 Flash while running at 0.77B parameters.

Significance. If the performance claims hold after clarifying the label mapping, the work would show that parameter-efficient fine-tuning of a compact VLM can deliver structured, programmatically usable outputs that exceed larger general-purpose models on a domain-specific task. The high valid-JSON rate, single-GPU inference, and open deployment as a Hugging Face Space integrated with the Loom recommendation system add concrete practical value for fashion retrieval and recommendation pipelines.

major comments (1)
  1. [Abstract / data-preparation section] Abstract and data-preparation section: the rule-based collapse of the original 228 iMaterialist labels into the 6-category/16-color/19-style schema is described only as 'rule-based label engineering' with no explicit rules, examples, consistency checks, or human validation provided. All headline metrics (94.6% category accuracy, 63.0% material accuracy, 0.753 style F1 on the 461-image held-out set) are computed exclusively against this collapsed schema; without the mapping details it is impossible to determine whether the reported gains reflect genuine attribute extraction or artifacts of how semantically distinct fine-grained labels were merged or reassigned.
minor comments (2)
  1. [Experiments section] The construction of the 461-image held-out test set is not described (sampling method, stratification, or overlap with training data), nor is any error analysis or qualitative failure cases provided to support the quantitative claims.
  2. [Training details] Training hyperparameters beyond LoRA rank/alpha and epoch count (optimizer, learning rate, loss, batch size, hardware) are omitted, reducing reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We agree that greater transparency in the label-mapping process is necessary for readers to properly interpret the reported metrics and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract / data-preparation section] Abstract and data-preparation section: the rule-based collapse of the original 228 iMaterialist labels into the 6-category/16-color/19-style schema is described only as 'rule-based label engineering' with no explicit rules, examples, consistency checks, or human validation provided. All headline metrics (94.6% category accuracy, 63.0% material accuracy, 0.753 style F1 on the 461-image held-out set) are computed exclusively against this collapsed schema; without the mapping details it is impossible to determine whether the reported gains reflect genuine attribute extraction or artifacts of how semantically distinct fine-grained labels were merged or reassigned.

    Authors: We agree that the manuscript's description of the label engineering is too brief and that this omission hinders evaluation of whether the collapsed schema preserves semantic distinctions or introduces artifacts. In the revised version we will add a dedicated subsection (and accompanying table in the appendix) that explicitly documents the full rule-based mappings: for each of the 6 categories, 16 colors, 19 styles, and the material/occasion fields we will list every original iMaterialist label that was reassigned to the target label, together with the decision rule applied (e.g., “all denim-related fine-grained labels map to ‘denim’ under material”). We will also include representative before-and-after examples, describe the automated consistency checks that were performed during engineering, and state that no additional human validation step was conducted beyond the rule-based procedure. These additions will allow readers to verify the integrity of the schema and to judge whether the performance differences versus GPT-4o-mini and Gemini 2.5 Flash reflect genuine attribute-extraction capability on the target structured output. The headline metrics themselves will remain unchanged because they are computed against the intended downstream schema; the revision simply makes the mapping transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard supervised fine-tuning and held-out evaluation

full rationale

The paper performs standard LoRA fine-tuning of Florence-2 on preprocessed iMaterialist data (collapsed via rule-based engineering into 6/16/19 schema) and reports accuracy on a held-out test set of 461 images against external baselines. No equations, self-citations, or fitted parameters are invoked in a load-bearing way that reduces any claimed prediction or result to the inputs by construction. The derivation chain consists of ordinary ML training and evaluation steps that remain independent of the reported metrics.

Axiom & Free-Parameter Ledger

3 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical results after standard LoRA adaptation; the only notable free parameters are the chosen LoRA rank, alpha, and epoch count, which are typical hyperparameters rather than invented quantities.

free parameters (3)
  • LoRA rank r
    Hyperparameter controlling adaptation rank, set to 16.
  • LoRA alpha
    Scaling factor for LoRA updates, set to 32.
  • Training epochs
    Number of training passes, set to 3.

pith-pipeline@v0.9.0 · 5555 in / 1403 out tokens · 49192 ms · 2026-05-12T05:06:56.240352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

  1. [1]

    J., Attanasio, G., Bianchi, F., Terragni, S., Hung, A., Lucaselli, E., Gashteovski, K., Rossiello, G., Subra- manian, N., and Zanzotto, R

    Chia, P. J., Attanasio, G., Bianchi, F., Terragni, S., Hung, A., Lucaselli, E., Gashteovski, K., Rossiello, G., Subramanian, N., and Zanzotto, R. FashionCLIP : Connecting language and images for product representations. arXiv preprint arXiv:2204.03972, 2022

  2. [2]

    FashionVLP : Vision language transformer for fashion retrieval with feedback

    Goenka, S., Zheng, Z., Jain, A., Srikumar, V., Annavaram, M., and Bui, T. FashionVLP : Vision language transformer for fashion retrieval with feedback. In CVPR, 2022

  3. [3]

    R., and Belongie, S

    Guo, S., Huang, W., Zhang, X., Srikhanta, P., Cui, Y., Li, Y., Adam, H., Scott, M. R., and Belongie, S. The iMaterialist fashion attribute dataset. In CVPR Workshops, 2019

  4. [4]

    J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W

    Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. LoRA : Low-rank adaptation of large language models. In ICLR, 2022

  5. [5]

    Can GPT-4o mini and Gemini 2.0 Flash predict fine-grained fashion product attributes? A zero-shot analysis

    Kumar, A., et al. Can GPT-4o mini and Gemini 2.0 Flash predict fine-grained fashion product attributes? A zero-shot analysis. arXiv preprint arXiv:2507.09950, 2025

  6. [6]

    DeepFashion : Powering robust clothes recognition and retrieval with rich annotations

    Liu, Z., Luo, P., Qiu, S., Wang, X., and Tang, X. DeepFashion : Powering robust clothes recognition and retrieval with rich annotations. In CVPR, 2016

  7. [7]

    W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I

    Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In ICML, 2021

  8. [8]

    Florence-2 : Advancing a unified representation for a variety of vision tasks

    Xiao, B., Wu, H., Xu, W., Dai, X., Hu, H., Lu, Y., Zeng, M., Liu, C., and Yuan, L. Florence-2 : Advancing a unified representation for a variety of vision tasks. In CVPR, 2024