Fashion Florence: Fine-Tuning Florence-2 for Structured Fashion Attribute Extraction

Anushree Berlia

arxiv: 2605.09827 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Fashion Florence: Fine-Tuning Florence-2 for Structured Fashion Attribute Extraction

Anushree Berlia This is my paper

Pith reviewed 2026-05-12 05:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords fashion attribute extractionLoRA fine-tuningvision-language modelstructured JSON outputFlorence-2iMaterialist Fashion datasetclothing image analysis

0 comments

The pith

Fine-tuning Florence-2 on collapsed iMaterialist labels produces structured JSON fashion attributes with 94.6% category accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates how to adapt Florence-2, a vision-language model, to clothing images by first collapsing the iMaterialist Fashion dataset's 228 fine-grained labels into a compact schema of 6 categories, 16 colors, and 19 styles through rule-based engineering. LoRA is then applied to the decoder layers and the model is trained for three epochs on 3,688 examples so that a single photograph yields a ready-to-use JSON object listing category, color, material, style, and occasion. This approach delivers higher accuracy than zero-shot calls to GPT-4o-mini or Gemini 2.5 Flash while emitting syntactically valid JSON in 99.8 percent of cases. The resulting 0.77-billion-parameter model runs on a single GPU at negligible extra cost and has already been deployed inside an open-source outfit recommender. The work shows that targeted, data-efficient adaptation of an open model can outperform much larger general-purpose systems on a narrow extraction task whose outputs are directly consumable by downstream software.

Core claim

Fine-tuning Florence-2 with LoRA (r=16, alpha=32) on 3,688 images whose labels have been rule-collapsed from the iMaterialist Fashion dataset into a 6-category, 16-color, 19-style schema yields a model that, given one clothing photograph, emits a JSON object containing category, color, material, style, and occasion tags. On a held-out test set of 461 images the model records 94.6% category accuracy, 63.0% material accuracy, and 0.753 style-tag F1, exceeding the corresponding figures for GPT-4o-mini and Gemini 2.5 Flash, while producing valid JSON in 99.8% of outputs.

What carries the argument

LoRA adaptation of all decoder linear layers in Florence-2, trained to map images to a fixed compact JSON schema obtained by rule-based collapse of the original 228 iMaterialist labels.

If this is right

The 0.77B model can be run on a single GPU with zero marginal inference cost beyond loading the base weights.
Valid JSON output in 99.8% of cases allows direct piping into recommendation, retrieval, or inventory systems without additional parsing logic.
Style and material extraction F1 scores exceed those of much larger proprietary models, showing that domain-specific adaptation improves narrow attribute accuracy.
The same fine-tuned weights have already been integrated into an open-source outfit recommender called Loom.
Because only decoder linear layers receive LoRA updates, the adaptation remains lightweight and reproducible on modest hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same label-collapsing plus LoRA recipe could be reused for other visual domains that require structured attribute extraction, such as furniture or product photography.
For repetitive, narrow-scope extraction tasks, a small domain-adapted model may deliver a better accuracy-to-cost ratio than repeated calls to frontier general models.
If the schema collapse proves unbiased on new data, the method supplies a practical template for turning any large attribute-rich dataset into a deployable structured extractor.

Load-bearing premise

The rule-based collapse of 228 fine-grained labels into the compact 6-category, 16-color, 19-style schema preserves semantic distinctions without introducing systematic bias or information loss.

What would settle it

Human re-annotation of the 461 test images using the collapsed schema, followed by re-scoring the model against those fresh labels, would reveal whether the reported accuracies depend on artifacts of the original label mapping.

read the original abstract

We present Fashion Florence, a Florence-2 vision-language model fine-tuned with LoRA to extract structured fashion attributes from clothing images. Given a single photograph, the model generates a JSON object containing category, color, material, style tags, and occasion tags, structured output suitable for direct programmatic consumption by downstream recommendation and retrieval systems. Fine-tuning data is derived from the iMaterialist Fashion dataset (228 labels), where we collapse fine-grained annotations into a compact 6-category, 16-color, 19-style schema via rule-based label engineering. We apply LoRA (r=16, alpha=32) to all decoder linear layers, training for 3 epochs on 3,688 examples. On a held-out test set of 461 images, Fashion Florence achieves 94.6% category accuracy and 63.0% material accuracy, compared to 89.3% / 43.3% for GPT-4o-mini and 87.4% for Gemini 2.5 Flash. Fashion Florence produces valid JSON in 99.8% of outputs while running at 0.77B parameters on a single GPU at zero marginal inference cost. Style tag F1 reaches 0.753 vs. 0.612 (Gemini) and 0.398 (GPT-4o-mini). The model is deployed as a Hugging Face Space and integrated into Loom, an open-source outfit recommendation system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical LoRA fine-tune of Florence-2 on collapsed iMaterialist labels that beats a couple of commercial models on structured fashion output, but the label mapping lacks any validation or transparency.

read the letter

The core of this paper is a standard application of LoRA to Florence-2 for turning clothing photos into JSON with category, color, material, style, and occasion fields. They start from the iMaterialist dataset, collapse its 228 fine-grained labels into a much smaller schema (6 categories, 16 colors, 19 styles), train for three epochs, and report 94.6% category accuracy and 63% material accuracy on 461 held-out images, plus 99.8% valid JSON. That beats GPT-4o-mini and Gemini 2.5 Flash on the same collapsed labels while running locally at 0.77B parameters. The deployment as a Hugging Face Space and integration into an outfit recommender is a nice touch for anyone who actually needs to run this in production without API costs. What stands out is the concrete comparison and the emphasis on structured, parseable output rather than free text. That part is useful for e-commerce pipelines. The rest is routine domain adaptation with no new method or theoretical angle. The soft spot is the label collapse itself. The abstract calls it rule-based engineering but gives no rules, examples, or human checks for whether distinct materials or styles got merged in ways that distort the task. Since the baselines are scored on the same collapsed targets, the relative gains might hold, but the absolute numbers become harder to trust for downstream use. There is also no error analysis or description of how the 461-image test split was made. Those gaps are real but not fatal for an applied note. This paper is mainly for practitioners who want a small, open model that reliably spits out fashion JSON without calling big APIs. Core ML readers will find little to engage with. It deserves a serious referee in an applied or industry track, provided the authors add the missing mapping details and at least basic error breakdown. I would send it out rather than desk reject.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Fashion Florence, a Florence-2 vision-language model fine-tuned with LoRA to extract structured fashion attributes (category, color, material, style, occasion) from single clothing images and output them as JSON. Fine-tuning data is created by collapsing the 228 fine-grained labels from the iMaterialist Fashion dataset into a compact schema (6 categories, 16 colors, 19 styles) via rule-based label engineering. Training uses LoRA (r=16, alpha=32) on decoder layers for 3 epochs on 3,688 examples. On a held-out test set of 461 images, the model reports 94.6% category accuracy, 63.0% material accuracy, 0.753 style F1, and 99.8% valid JSON outputs, outperforming GPT-4o-mini and Gemini 2.5 Flash while running at 0.77B parameters.

Significance. If the performance claims hold after clarifying the label mapping, the work would show that parameter-efficient fine-tuning of a compact VLM can deliver structured, programmatically usable outputs that exceed larger general-purpose models on a domain-specific task. The high valid-JSON rate, single-GPU inference, and open deployment as a Hugging Face Space integrated with the Loom recommendation system add concrete practical value for fashion retrieval and recommendation pipelines.

major comments (1)

[Abstract / data-preparation section] Abstract and data-preparation section: the rule-based collapse of the original 228 iMaterialist labels into the 6-category/16-color/19-style schema is described only as 'rule-based label engineering' with no explicit rules, examples, consistency checks, or human validation provided. All headline metrics (94.6% category accuracy, 63.0% material accuracy, 0.753 style F1 on the 461-image held-out set) are computed exclusively against this collapsed schema; without the mapping details it is impossible to determine whether the reported gains reflect genuine attribute extraction or artifacts of how semantically distinct fine-grained labels were merged or reassigned.

minor comments (2)

[Experiments section] The construction of the 461-image held-out test set is not described (sampling method, stratification, or overlap with training data), nor is any error analysis or qualitative failure cases provided to support the quantitative claims.
[Training details] Training hyperparameters beyond LoRA rank/alpha and epoch count (optimizer, learning rate, loss, batch size, hardware) are omitted, reducing reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We agree that greater transparency in the label-mapping process is necessary for readers to properly interpret the reported metrics and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / data-preparation section] Abstract and data-preparation section: the rule-based collapse of the original 228 iMaterialist labels into the 6-category/16-color/19-style schema is described only as 'rule-based label engineering' with no explicit rules, examples, consistency checks, or human validation provided. All headline metrics (94.6% category accuracy, 63.0% material accuracy, 0.753 style F1 on the 461-image held-out set) are computed exclusively against this collapsed schema; without the mapping details it is impossible to determine whether the reported gains reflect genuine attribute extraction or artifacts of how semantically distinct fine-grained labels were merged or reassigned.

Authors: We agree that the manuscript's description of the label engineering is too brief and that this omission hinders evaluation of whether the collapsed schema preserves semantic distinctions or introduces artifacts. In the revised version we will add a dedicated subsection (and accompanying table in the appendix) that explicitly documents the full rule-based mappings: for each of the 6 categories, 16 colors, 19 styles, and the material/occasion fields we will list every original iMaterialist label that was reassigned to the target label, together with the decision rule applied (e.g., “all denim-related fine-grained labels map to ‘denim’ under material”). We will also include representative before-and-after examples, describe the automated consistency checks that were performed during engineering, and state that no additional human validation step was conducted beyond the rule-based procedure. These additions will allow readers to verify the integrity of the schema and to judge whether the performance differences versus GPT-4o-mini and Gemini 2.5 Flash reflect genuine attribute-extraction capability on the target structured output. The headline metrics themselves will remain unchanged because they are computed against the intended downstream schema; the revision simply makes the mapping transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard supervised fine-tuning and held-out evaluation

full rationale

The paper performs standard LoRA fine-tuning of Florence-2 on preprocessed iMaterialist data (collapsed via rule-based engineering into 6/16/19 schema) and reports accuracy on a held-out test set of 461 images against external baselines. No equations, self-citations, or fitted parameters are invoked in a load-bearing way that reduces any claimed prediction or result to the inputs by construction. The derivation chain consists of ordinary ML training and evaluation steps that remain independent of the reported metrics.

Axiom & Free-Parameter Ledger

3 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical results after standard LoRA adaptation; the only notable free parameters are the chosen LoRA rank, alpha, and epoch count, which are typical hyperparameters rather than invented quantities.

free parameters (3)

LoRA rank r
Hyperparameter controlling adaptation rank, set to 16.
LoRA alpha
Scaling factor for LoRA updates, set to 32.
Training epochs
Number of training passes, set to 3.

pith-pipeline@v0.9.0 · 5555 in / 1403 out tokens · 49192 ms · 2026-05-12T05:06:56.240352+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We collapse fine-grained annotations into a compact 6-category, 16-color, 19-style schema via rule-based label engineering.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We apply LoRA (r=16, alpha=32) to all decoder linear layers, training for 3 epochs on 3,688 examples.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

J., Attanasio, G., Bianchi, F., Terragni, S., Hung, A., Lucaselli, E., Gashteovski, K., Rossiello, G., Subra- manian, N., and Zanzotto, R

Chia, P. J., Attanasio, G., Bianchi, F., Terragni, S., Hung, A., Lucaselli, E., Gashteovski, K., Rossiello, G., Subramanian, N., and Zanzotto, R. FashionCLIP : Connecting language and images for product representations. arXiv preprint arXiv:2204.03972, 2022

work page arXiv 2022
[2]

FashionVLP : Vision language transformer for fashion retrieval with feedback

Goenka, S., Zheng, Z., Jain, A., Srikumar, V., Annavaram, M., and Bui, T. FashionVLP : Vision language transformer for fashion retrieval with feedback. In CVPR, 2022

work page 2022
[3]

R., and Belongie, S

Guo, S., Huang, W., Zhang, X., Srikhanta, P., Cui, Y., Li, Y., Adam, H., Scott, M. R., and Belongie, S. The iMaterialist fashion attribute dataset. In CVPR Workshops, 2019

work page 2019
[4]

J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. LoRA : Low-rank adaptation of large language models. In ICLR, 2022

work page 2022
[5]

Can GPT-4o mini and Gemini 2.0 Flash predict fine-grained fashion product attributes? A zero-shot analysis

Kumar, A., et al. Can GPT-4o mini and Gemini 2.0 Flash predict fine-grained fashion product attributes? A zero-shot analysis. arXiv preprint arXiv:2507.09950, 2025

work page arXiv 2025
[6]

DeepFashion : Powering robust clothes recognition and retrieval with rich annotations

Liu, Z., Luo, P., Qiu, S., Wang, X., and Tang, X. DeepFashion : Powering robust clothes recognition and retrieval with rich annotations. In CVPR, 2016

work page 2016
[7]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In ICML, 2021

work page 2021
[8]

Florence-2 : Advancing a unified representation for a variety of vision tasks

Xiao, B., Wu, H., Xu, W., Dai, X., Hu, H., Lu, Y., Zeng, M., Liu, C., and Yuan, L. Florence-2 : Advancing a unified representation for a variety of vision tasks. In CVPR, 2024

work page 2024

[1] [1]

J., Attanasio, G., Bianchi, F., Terragni, S., Hung, A., Lucaselli, E., Gashteovski, K., Rossiello, G., Subra- manian, N., and Zanzotto, R

Chia, P. J., Attanasio, G., Bianchi, F., Terragni, S., Hung, A., Lucaselli, E., Gashteovski, K., Rossiello, G., Subramanian, N., and Zanzotto, R. FashionCLIP : Connecting language and images for product representations. arXiv preprint arXiv:2204.03972, 2022

work page arXiv 2022

[2] [2]

FashionVLP : Vision language transformer for fashion retrieval with feedback

Goenka, S., Zheng, Z., Jain, A., Srikumar, V., Annavaram, M., and Bui, T. FashionVLP : Vision language transformer for fashion retrieval with feedback. In CVPR, 2022

work page 2022

[3] [3]

R., and Belongie, S

Guo, S., Huang, W., Zhang, X., Srikhanta, P., Cui, Y., Li, Y., Adam, H., Scott, M. R., and Belongie, S. The iMaterialist fashion attribute dataset. In CVPR Workshops, 2019

work page 2019

[4] [4]

J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. LoRA : Low-rank adaptation of large language models. In ICLR, 2022

work page 2022

[5] [5]

Can GPT-4o mini and Gemini 2.0 Flash predict fine-grained fashion product attributes? A zero-shot analysis

Kumar, A., et al. Can GPT-4o mini and Gemini 2.0 Flash predict fine-grained fashion product attributes? A zero-shot analysis. arXiv preprint arXiv:2507.09950, 2025

work page arXiv 2025

[6] [6]

DeepFashion : Powering robust clothes recognition and retrieval with rich annotations

Liu, Z., Luo, P., Qiu, S., Wang, X., and Tang, X. DeepFashion : Powering robust clothes recognition and retrieval with rich annotations. In CVPR, 2016

work page 2016

[7] [7]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In ICML, 2021

work page 2021

[8] [8]

Florence-2 : Advancing a unified representation for a variety of vision tasks

Xiao, B., Wu, H., Xu, W., Dai, X., Hu, H., Lu, Y., Zeng, M., Liu, C., and Yuan, L. Florence-2 : Advancing a unified representation for a variety of vision tasks. In CVPR, 2024

work page 2024