What Matters for Grocery Product Retrieval with Open Source Vision Language Models
Pith reviewed 2026-05-20 11:56 UTC · model grok-4.3
The pith
Data quality beats model size for grocery product retrieval accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In zero-shot multimodal product retrieval on the GroceryVision Challenge, pre-training data quality is the dominant factor over scale and architecture; filtered datasets deliver up to 16.6 percent accuracy gains that exceed the gains from doubling parameters, efficient models such as MobileCLIP-B can outperform larger noisy-trained counterparts, and state-of-the-art models reach 94.5 percent Recall@5 yet fall 17.5 percent at Recall@1 because contrastive embeddings separate categories but do not reliably order near-identical SKUs.
What carries the argument
Semantic power density (φ), an efficiency metric that penalizes models whose accuracy falls below a chosen threshold, used to compare models across data quality, size, and resolution.
If this is right
- Retail deployments can gain more from curating training data than from increasing model size.
- Resource-limited settings can adopt smaller models such as MobileCLIP-B without sacrificing accuracy if clean data is used.
- Current contrastive training leaves a ranking gap that must be addressed for precise SKU-level decisions.
Where Pith is reading between the lines
- Data-curation pipelines may prove more cost-effective than continued parameter scaling for other fine-grained retail or inventory tasks.
- Hybrid systems that combine embedding retrieval with a second verification stage could close the observed Recall@1 gap.
- The same data-quality priority may apply when adapting these models to visually similar domains such as fashion or electronics catalogs.
Load-bearing premise
Results on the GroceryVision Challenge zero-shot protocol reflect the fine-grained discrimination needs of real checkout-free retail systems.
What would settle it
A controlled test in which a model trained only on raw web-scraped data achieves higher Recall@1 than a model trained on filtered data, using the same architecture and evaluation protocol, would falsify the claim that data quality is the primary driver.
Figures
read the original abstract
Multimodal product retrieval (MPR) underpins checkout-free retail and automated inventory systems, yet it demands fine-grained SKU discrimination that standard vision-language benchmarks fail to capture. We present the first systematic zero-shot evaluation of 190 open-source VLMs on the MPR task of the GroceryVision Challenge, isolating pre-training data, architecture, and input resolution. Our analysis yields three actionable findings. \textbf{(1) Data quality trumps scale.} Switching from raw web-scrapes to filtered datasets delivers up to 16.6\% accuracy gains, exceeding the benefit of doubling model parameters. \textbf{(2) Efficient models can win.} MobileCLIP-B (150M parameters) outperforms 351M counterparts trained on noisy data. We introduce \textit{semantic power density} ($\phi$), an efficiency metric that penalizes sub-threshold accuracy. \textbf{(3) A precision gap persists.} State-of-the-art models achieve 94.5\% Recall@5 but suffer a 17.5\% drop at Recall@1, revealing that contrastive embeddings cluster categories effectively but fail to rank visually similar SKUs. Code and evaluation scripts are available at \url{https://github.com/upeee/openmpr}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a systematic zero-shot evaluation of 190 open-source vision-language models on the multimodal product retrieval (MPR) task from the GroceryVision Challenge. It isolates the effects of pre-training data quality, model architecture, and input resolution, reporting that filtered datasets yield up to 16.6% accuracy improvements over raw web-scrapes, surpassing gains from doubling parameters. Efficient models like MobileCLIP-B outperform larger ones trained on noisy data, and a new metric 'semantic power density' (φ) is introduced. State-of-the-art models reach 94.5% Recall@5 but drop significantly at Recall@1.
Significance. If the empirical findings are robust, this study provides valuable insights into factors affecting VLM performance in fine-grained product retrieval for retail applications. Highlighting data quality over scale and introducing an efficiency metric could guide practitioners in model selection for resource-limited environments. The availability of code enhances reproducibility and allows for extensions.
major comments (2)
- [Results] Results section (reporting the 16.6% gain): The headline attribution of accuracy improvements to filtered vs. raw pre-training data requires explicit evidence that architecture family and input resolution were matched in the contrasted groups. The abstract asserts isolation of these factors, yet without tabulated within-family comparisons, parameter-matched deltas, or a clear breakdown of how the maximum gain was computed, the causal claim that data quality exceeds the benefit of doubling parameters remains unverified and is load-bearing for finding (1).
- [Experiments] Experiments or Evaluation section: The manuscript should report dataset splits, number of SKUs, and any statistical tests or variance estimates supporting the quantitative claims (e.g., the 17.5% Recall@1 drop and 94.5% Recall@5). Absence of error analysis or confidence intervals makes it difficult to assess whether the precision gap is robust or benchmark-specific.
minor comments (2)
- [Methodology] Define and motivate the semantic power density φ metric with its exact formula in the main text (rather than only in supplementary material) so readers can reproduce the efficiency ranking.
- [Appendix] Add a table or appendix listing the exact 190 models with their parameter counts, pre-training data sources, and resolutions to support the isolation claims.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to improve clarity and support for our claims where possible.
read point-by-point responses
-
Referee: [Results] Results section (reporting the 16.6% gain): The headline attribution of accuracy improvements to filtered vs. raw pre-training data requires explicit evidence that architecture family and input resolution were matched in the contrasted groups. The abstract asserts isolation of these factors, yet without tabulated within-family comparisons, parameter-matched deltas, or a clear breakdown of how the maximum gain was computed, the causal claim that data quality exceeds the benefit of doubling parameters remains unverified and is load-bearing for finding (1).
Authors: We acknowledge the need for more explicit controls in presenting the data quality results. In the revised manuscript we have added a table of within-family comparisons (e.g., CLIP-family models) at matched input resolutions, together with a clear description of how the 16.6% maximum gain was obtained from those controlled pairs. We also include specific parameter-matched examples showing that data filtering yields larger gains than doubling parameters within the same data regime. These additions directly support the isolation of factors claimed in the abstract. revision: yes
-
Referee: [Experiments] Experiments or Evaluation section: The manuscript should report dataset splits, number of SKUs, and any statistical tests or variance estimates supporting the quantitative claims (e.g., the 17.5% Recall@1 drop and 94.5% Recall@5). Absence of error analysis or confidence intervals makes it difficult to assess whether the precision gap is robust or benchmark-specific.
Authors: We have expanded the Experiments section to report the dataset splits and the exact number of SKUs used from the GroceryVision Challenge. Because the evaluation is a single deterministic zero-shot pass over a fixed test set, we did not compute variance estimates or statistical tests. We have added an explicit note clarifying this limitation and the deterministic nature of the reported Recall@1 and Recall@5 figures. This revision improves transparency while remaining faithful to the experimental design. revision: partial
Circularity Check
No significant circularity; empirical benchmarking with external dataset
full rationale
The paper conducts a zero-shot empirical evaluation of 190 open-source VLMs on the GroceryVision Challenge MPR task. All reported accuracy gains, including the 16.6% figure for filtered vs. raw data and comparisons to parameter scaling, are direct measurements against an external benchmark rather than derivations, fitted parameters renamed as predictions, or self-referential equations. The introduced semantic power density metric is a new definition for efficiency analysis but is not used to generate any load-bearing predictions that reduce to the inputs by construction. No self-citation chains or uniqueness theorems underpin the central claims. The study is self-contained against the external challenge dataset with no circular reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Zero-shot performance on the GroceryVision Challenge MPR task reflects the models' general pre-training quality without task-specific adaptation.
invented entities (1)
-
semantic power density (φ)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce semantic power density (ϕ), an efficiency metric that penalizes sub-threshold accuracy. ϕ = (Recall@1 / (1-Recall@1+ε))² / Nparams ×100
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Switching from raw web-scrapes to filtered datasets delivers up to 16.6% accuracy gains, exceeding the benefit of doubling model parameters.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Products-10K: A large-scale product recognition dataset
Bai, Y., Chen, Y., Yu, W., Wang, L., Zhang, W.: Products-10k: A large-scale product recognition dataset. arXiv preprint arXiv:2008.10545 (2020)
-
[2]
In: Belgrave, D., Zhang, C., Lin, H., Pascanu, R., Koniusz, P., Ghassemi, M., Chen, N
Bolya, D., Huang, P.Y., Sun, P., Cho, J.H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Bangalath, H., Wang, J., Monteiro, M., Xu, H., Dong, S., Ravi, N., Li, S.W., Dollar, P., Feichtenhofer, C.: Perception encoder: The best visual embeddings are not at the output of the network. In: Belgrave, D., Zhang, C., Lin, H., Pascanu, R., Koniusz, P., ...
work page 2025
-
[3]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chen, J., Yu, Q., Shen, X., Yuille, A., Chen, L.C.: Vitamin: Designing scalable vision models in the vision-language era. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12954–12966 (2024)
work page 2024
-
[4]
In: International Conference on Learning Rep- resentations (ICLR) (2023)
Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al.: Pali: A jointly-scaled multilingual language-image model. In: International Conference on Learning Rep- resentations (ICLR) (2023)
work page 2023
-
[5]
Advances in Neural Information Processing Systems38, 48009–48036 (2026)
Chuang, Y.S., Li, Y., Wang, D., Yeh, C.F., Lyu, K., Raghavendra, R., Glass, J., Huang, L., Weston, J., Zettlemoyer, L., et al.: Meta clip 2: A worldwide scaling recipe. Advances in Neural Information Processing Systems38, 48009–48036 (2026)
work page 2026
-
[6]
arXiv preprint arXiv:2504.07567 (2025) 14 E
Czerwinska, U., Bircanoglu, C., Chamoux, J.: Benchmarking image embeddings for e-commerce: Evaluating off-the shelf foundation models, fine-tuning strategies and practical trade-offs. arXiv preprint arXiv:2504.07567 (2025) 14 E. G. Maminta, R. O. Atienza
-
[7]
In: International Conference on Learning Representations (2022)
Dehghani, M., Tay, Y., Arnab, A., Beyer, L., Vaswani, A.: The efficiency misnomer. In: International Conference on Learning Representations (2022)
work page 2022
-
[8]
In: International Conference on Learning Representations (2021)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
work page 2021
-
[9]
https://grocery-vision.github.io/past_challenge/iccv2025.html (2025), accessed: 2025-11-27
Fan, Q., Li, W., Miao, S., Ma, S.: The 4th groceryvision challenge: Iccv25 retailvision workshop. https://grocery-vision.github.io/past_challenge/iccv2025.html (2025), accessed: 2025-11-27
work page 2025
-
[10]
Advances in Neural Information Processing Systems36, 27092–27112 (2023)
Gadre, S.Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., et al.: Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems36, 27092–27112 (2023)
work page 2023
-
[11]
Grattafiori, A., et al.: The llama 3 herd of models (2024), https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp. 770–778 (2016)
work page 2016
-
[13]
In: Proceedings of the 36th International Conference on Neural Information Processing Systems
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., et al.: Training compute- optimal large language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. pp. 30016–30030 (2022)
work page 2022
-
[14]
Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: Openclip (Jul 2021). https://doi.org/10.5281/zenodo.5143773, https://doi.org/10.5281/zenodo.5143773
-
[15]
Khattab, O., Zaharia, M.: ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 39–48 (2020)
work page 2020
-
[16]
In: 2011 International conference on computer vision
Leutenegger, S., Chli, M., Siegwart, R.Y.: Brisk: Binary robust invariant scalable keypoints. In: 2011 International conference on computer vision. pp. 2548–2555. IEEE (2011)
work page 2011
-
[17]
Advances in Neural Information Processing Systems35, 9287–9301 (2022)
Li, C., Liu, H., Li, L., Zhang, P., Aneja, J., Yang, J., Jin, P., Hu, H., Liu, Z., Lee, Y.J., et al.: Elevater: A benchmark and toolkit for evaluating language-augmented visual models. Advances in Neural Information Processing Systems35, 9287–9301 (2022)
work page 2022
-
[18]
Advances in Neural Information Processing Systems36, 49068–49087 (2023)
Li, X., Wang, Z., Xie, C.: An inverse scaling law for clip training. Advances in Neural Information Processing Systems36, 49068–49087 (2023)
work page 2023
-
[19]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[20]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11976–11986 (2022)
work page 2022
-
[21]
International journal of computer vision60(2), 91–110 (2004)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International journal of computer vision60(2), 91–110 (2004)
work page 2004
-
[22]
SmolVLM: Redefining small and efficient multimodal models
Marafioti, A., Zohar, O., Farré, M., Noyan, M., Bakouch, E., Cuenca, P., Zakka, C., Allal, L.B., Lozhkov, A., Tazi, N., et al.: Smolvlm: Redefining small and efficient multimodal models. arXiv preprint arXiv:2504.05299 (2025) Grocery Product Retrieval with Open Source VLMs 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Nogueira, R., Cho, K.: Passage re-ranking with BERT. arXiv preprint arXiv:1901.04085 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[24]
Representation Learning with Contrastive Predictive Coding
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[25]
arXiv preprint arXiv:2006.12634 (2020)
Peng, J., Xiao, C., Li, Y.: Rp2k: A large-scale retail product dataset for fine-grained image classification. arXiv preprint arXiv:2006.12634 (2020)
-
[26]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
work page 2021
-
[27]
Advances in neural information processing systems35, 25278–25294 (2022)
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems35, 25278–25294 (2022)
work page 2022
-
[28]
In: Interna- tional Conference on Image Analysis and Recognition
Srivastava, M.M.: Bag of tricks for retail product image classification. In: Interna- tional Conference on Image Analysis and Recognition. pp. 71–82. Springer (2020)
work page 2020
-
[29]
arXiv preprint arXiv:2312.10282 (2023)
Srivastava, M.M.: Retailklip: Finetuning openclip backbone using metric learning on a single gpu for zero-shot retail product image classification. arXiv preprint arXiv:2312.10282 (2023)
-
[30]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops
Srivastava, S., Wu, K.: Sgbd: Sharpness-aware mirror gradient with blip-based denoising for robust multimodal product recommendation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. pp. 2380–2389 (2025)
work page 2025
-
[31]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Communications of the ACM59(2), 64–73 (2016)
Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: Yfcc100m: The new data in multimedia research. Communications of the ACM59(2), 64–73 (2016)
work page 2016
-
[33]
Computer Vision and Image Understanding182, 81–92 (2019)
Tonioni, A., Di Stefano, L.: Domain invariant hierarchical embedding for grocery products recognition. Computer Vision and Image Understanding182, 81–92 (2019)
work page 2019
-
[34]
Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., Hénaff, O., Harm- sen, J., Steiner, A., Zhai, X.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Vasu, P.K.A., Pouransari, H., Faghri, F., Vemulapalli, R., Tuzel, O.: Mobileclip: Fast image-text models through multi-modal reinforced training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 15963–15974 (2024)
work page 2024
-
[36]
arXiv preprint arXiv:2309.01859 (2023)
Visheratin, A.: Nllb-clip–train performant multilingual image retrieval model on a budget. arXiv preprint arXiv:2309.01859 (2023)
-
[37]
In: International Conference on Learning Representations (2024)
Xu, H., Xie, S., Tan, X., Huang, P.Y., Howes, R., Sharma, V., Li, S.W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.: Demystifying clip data. In: International Conference on Learning Representations (2024)
work page 2024
-
[38]
Transactions on Machine Learning Research (2022)
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research (2022)
work page 2022
-
[39]
In: Proceedings of the IEEE/CVF international conference on computer vision
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre- training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.