What Matters for Grocery Product Retrieval with Open Source Vision Language Models

Emmanuel G. Maminta; Rowel O. Atienza

arxiv: 2605.18029 · v1 · pith:EQAFWJ5Znew · submitted 2026-05-18 · 💻 cs.CV

What Matters for Grocery Product Retrieval with Open Source Vision Language Models

Emmanuel G. Maminta , Rowel O. Atienza This is my paper

Pith reviewed 2026-05-20 11:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal product retrievalvision language modelszero-shot evaluationgrocery retrievaldata qualityfine-grained discriminationefficiency metric

0 comments

The pith

Data quality beats model size for grocery product retrieval accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests 190 open-source vision-language models in zero-shot mode on a grocery product matching task to determine what actually drives performance. It shows that training on filtered rather than raw web-scraped data produces accuracy gains as large as 16.6 percent, larger than the benefit obtained by doubling model parameters. Smaller efficient models trained on clean data can surpass much larger models trained on noisy data, and the work introduces an efficiency metric called semantic power density. Even top models reach high recall when only category-level matching is required but drop sharply when forced to rank visually similar individual stock-keeping units.

Core claim

In zero-shot multimodal product retrieval on the GroceryVision Challenge, pre-training data quality is the dominant factor over scale and architecture; filtered datasets deliver up to 16.6 percent accuracy gains that exceed the gains from doubling parameters, efficient models such as MobileCLIP-B can outperform larger noisy-trained counterparts, and state-of-the-art models reach 94.5 percent Recall@5 yet fall 17.5 percent at Recall@1 because contrastive embeddings separate categories but do not reliably order near-identical SKUs.

What carries the argument

Semantic power density (φ), an efficiency metric that penalizes models whose accuracy falls below a chosen threshold, used to compare models across data quality, size, and resolution.

If this is right

Retail deployments can gain more from curating training data than from increasing model size.
Resource-limited settings can adopt smaller models such as MobileCLIP-B without sacrificing accuracy if clean data is used.
Current contrastive training leaves a ranking gap that must be addressed for precise SKU-level decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data-curation pipelines may prove more cost-effective than continued parameter scaling for other fine-grained retail or inventory tasks.
Hybrid systems that combine embedding retrieval with a second verification stage could close the observed Recall@1 gap.
The same data-quality priority may apply when adapting these models to visually similar domains such as fashion or electronics catalogs.

Load-bearing premise

Results on the GroceryVision Challenge zero-shot protocol reflect the fine-grained discrimination needs of real checkout-free retail systems.

What would settle it

A controlled test in which a model trained only on raw web-scraped data achieves higher Recall@1 than a model trained on filtered data, using the same architecture and evaluation protocol, would falsify the claim that data quality is the primary driver.

Figures

Figures reproduced from arXiv: 2605.18029 by Emmanuel G. Maminta, Rowel O. Atienza.

**Figure 1.** Figure 1: The MPR inference protocol. MPR frames recognition as a ranking task. For each probe image, cross-modal similarity sorts SKUs by relevance, enabling zero-shot identification. have fine-tuned VLMs for retail domains [29] or evaluated specific architectures [6], but no systematic zero-shot benchmark isolates the effects of pre-training data, architecture, and input resolution. To bridge this gap, we conduct … view at source ↗

**Figure 2.** Figure 2: Caption refinement. Llama-3.1-8B-Instruct compresses descriptions exceeding 77 tokens into concise captions while retaining key visual attributes. The MPR dataset from the GroceryVision Challenge [9], released under CCBY-NC 4.0 which permits non-commercial use with attribution, contains 74,200 training images across 409 SKUs, subsampled to 12,944 front-facing perspectives. Original catalog metadata freque… view at source ↗

**Figure 3.** Figure 3: Architecture of the MPR pipeline. The VLM encoders (fθ, gϕ) map the probe image v and catalog descriptions {ti} into a shared embedding space. We compute the cosine similarity s(v, t) to rank the 409 SKUs. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The edge-centric efficiency landscape. Semantic power density (ϕ) vs. model size. Gray dots show the full population (N = 190). Blue circles mark highefficiency MobileCLIP models. Red squares indicate poor ROI. Triangles represent massive models with low density due to parameter bloat. MobileCLIP occupies the sweet spot for edge deployment [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Manifold collapse illustrated. Embedding projection for a representative query. The model must distinguish Chicken Corn Chowder from Chicken & Dumplings. Shared visual attributes (red) cause vectors to collapse into a narrow cone of confusion (∆θ ≈ 4 ◦ ). Dot product cannot separate them. 6 Conclusion Our benchmark of 190 VLMs is a diagnostic study of multimodal product retrieval. Three findings challenge … view at source ↗

read the original abstract

Multimodal product retrieval (MPR) underpins checkout-free retail and automated inventory systems, yet it demands fine-grained SKU discrimination that standard vision-language benchmarks fail to capture. We present the first systematic zero-shot evaluation of 190 open-source VLMs on the MPR task of the GroceryVision Challenge, isolating pre-training data, architecture, and input resolution. Our analysis yields three actionable findings. \textbf{(1) Data quality trumps scale.} Switching from raw web-scrapes to filtered datasets delivers up to 16.6\% accuracy gains, exceeding the benefit of doubling model parameters. \textbf{(2) Efficient models can win.} MobileCLIP-B (150M parameters) outperforms 351M counterparts trained on noisy data. We introduce \textit{semantic power density} ($\phi$), an efficiency metric that penalizes sub-threshold accuracy. \textbf{(3) A precision gap persists.} State-of-the-art models achieve 94.5\% Recall@5 but suffer a 17.5\% drop at Recall@1, revealing that contrastive embeddings cluster categories effectively but fail to rank visually similar SKUs. Code and evaluation scripts are available at \url{https://github.com/upeee/openmpr}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a systematic zero-shot evaluation of 190 open-source vision-language models on the multimodal product retrieval (MPR) task from the GroceryVision Challenge. It isolates the effects of pre-training data quality, model architecture, and input resolution, reporting that filtered datasets yield up to 16.6% accuracy improvements over raw web-scrapes, surpassing gains from doubling parameters. Efficient models like MobileCLIP-B outperform larger ones trained on noisy data, and a new metric 'semantic power density' (φ) is introduced. State-of-the-art models reach 94.5% Recall@5 but drop significantly at Recall@1.

Significance. If the empirical findings are robust, this study provides valuable insights into factors affecting VLM performance in fine-grained product retrieval for retail applications. Highlighting data quality over scale and introducing an efficiency metric could guide practitioners in model selection for resource-limited environments. The availability of code enhances reproducibility and allows for extensions.

major comments (2)

[Results] Results section (reporting the 16.6% gain): The headline attribution of accuracy improvements to filtered vs. raw pre-training data requires explicit evidence that architecture family and input resolution were matched in the contrasted groups. The abstract asserts isolation of these factors, yet without tabulated within-family comparisons, parameter-matched deltas, or a clear breakdown of how the maximum gain was computed, the causal claim that data quality exceeds the benefit of doubling parameters remains unverified and is load-bearing for finding (1).
[Experiments] Experiments or Evaluation section: The manuscript should report dataset splits, number of SKUs, and any statistical tests or variance estimates supporting the quantitative claims (e.g., the 17.5% Recall@1 drop and 94.5% Recall@5). Absence of error analysis or confidence intervals makes it difficult to assess whether the precision gap is robust or benchmark-specific.

minor comments (2)

[Methodology] Define and motivate the semantic power density φ metric with its exact formula in the main text (rather than only in supplementary material) so readers can reproduce the efficiency ranking.
[Appendix] Add a table or appendix listing the exact 190 models with their parameter counts, pre-training data sources, and resolutions to support the isolation claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to improve clarity and support for our claims where possible.

read point-by-point responses

Referee: [Results] Results section (reporting the 16.6% gain): The headline attribution of accuracy improvements to filtered vs. raw pre-training data requires explicit evidence that architecture family and input resolution were matched in the contrasted groups. The abstract asserts isolation of these factors, yet without tabulated within-family comparisons, parameter-matched deltas, or a clear breakdown of how the maximum gain was computed, the causal claim that data quality exceeds the benefit of doubling parameters remains unverified and is load-bearing for finding (1).

Authors: We acknowledge the need for more explicit controls in presenting the data quality results. In the revised manuscript we have added a table of within-family comparisons (e.g., CLIP-family models) at matched input resolutions, together with a clear description of how the 16.6% maximum gain was obtained from those controlled pairs. We also include specific parameter-matched examples showing that data filtering yields larger gains than doubling parameters within the same data regime. These additions directly support the isolation of factors claimed in the abstract. revision: yes
Referee: [Experiments] Experiments or Evaluation section: The manuscript should report dataset splits, number of SKUs, and any statistical tests or variance estimates supporting the quantitative claims (e.g., the 17.5% Recall@1 drop and 94.5% Recall@5). Absence of error analysis or confidence intervals makes it difficult to assess whether the precision gap is robust or benchmark-specific.

Authors: We have expanded the Experiments section to report the dataset splits and the exact number of SKUs used from the GroceryVision Challenge. Because the evaluation is a single deterministic zero-shot pass over a fixed test set, we did not compute variance estimates or statistical tests. We have added an explicit note clarifying this limitation and the deterministic nature of the reported Recall@1 and Recall@5 figures. This revision improves transparency while remaining faithful to the experimental design. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical benchmarking with external dataset

full rationale

The paper conducts a zero-shot empirical evaluation of 190 open-source VLMs on the GroceryVision Challenge MPR task. All reported accuracy gains, including the 16.6% figure for filtered vs. raw data and comparisons to parameter scaling, are direct measurements against an external benchmark rather than derivations, fitted parameters renamed as predictions, or self-referential equations. The introduced semantic power density metric is a new definition for efficiency analysis but is not used to generate any load-bearing predictions that reduce to the inputs by construction. No self-citation chains or uniqueness theorems underpin the central claims. The study is self-contained against the external challenge dataset with no circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the representativeness of the GroceryVision Challenge as a proxy for retail MPR and on standard zero-shot evaluation assumptions in the VLM literature.

axioms (1)

domain assumption Zero-shot performance on the GroceryVision Challenge MPR task reflects the models' general pre-training quality without task-specific adaptation.
The evaluation isolates pre-training data, architecture, and resolution under a zero-shot protocol.

invented entities (1)

semantic power density (φ) no independent evidence
purpose: Efficiency metric that penalizes models falling below a useful accuracy threshold when comparing models of different sizes.
Newly introduced in the paper to quantify the trade-off between accuracy and parameter count.

pith-pipeline@v0.9.0 · 5749 in / 1373 out tokens · 70730 ms · 2026-05-20T11:56:02.600387+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce semantic power density (ϕ), an efficiency metric that penalizes sub-threshold accuracy. ϕ = (Recall@1 / (1-Recall@1+ε))² / Nparams ×100
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Switching from raw web-scrapes to filtered datasets delivers up to 16.6% accuracy gains, exceeding the benefit of doubling model parameters.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 7 internal anchors

[1]

Products-10K: A large-scale product recognition dataset

Bai, Y., Chen, Y., Yu, W., Wang, L., Zhang, W.: Products-10k: A large-scale product recognition dataset. arXiv preprint arXiv:2008.10545 (2020)

work page arXiv 2008
[2]

In: Belgrave, D., Zhang, C., Lin, H., Pascanu, R., Koniusz, P., Ghassemi, M., Chen, N

Bolya, D., Huang, P.Y., Sun, P., Cho, J.H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Bangalath, H., Wang, J., Monteiro, M., Xu, H., Dong, S., Ravi, N., Li, S.W., Dollar, P., Feichtenhofer, C.: Perception encoder: The best visual embeddings are not at the output of the network. In: Belgrave, D., Zhang, C., Lin, H., Pascanu, R., Koniusz, P., ...

work page 2025
[3]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, J., Yu, Q., Shen, X., Yuille, A., Chen, L.C.: Vitamin: Designing scalable vision models in the vision-language era. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12954–12966 (2024)

work page 2024
[4]

In: International Conference on Learning Rep- resentations (ICLR) (2023)

Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al.: Pali: A jointly-scaled multilingual language-image model. In: International Conference on Learning Rep- resentations (ICLR) (2023)

work page 2023
[5]

Advances in Neural Information Processing Systems38, 48009–48036 (2026)

Chuang, Y.S., Li, Y., Wang, D., Yeh, C.F., Lyu, K., Raghavendra, R., Glass, J., Huang, L., Weston, J., Zettlemoyer, L., et al.: Meta clip 2: A worldwide scaling recipe. Advances in Neural Information Processing Systems38, 48009–48036 (2026)

work page 2026
[6]

arXiv preprint arXiv:2504.07567 (2025) 14 E

Czerwinska, U., Bircanoglu, C., Chamoux, J.: Benchmarking image embeddings for e-commerce: Evaluating off-the shelf foundation models, fine-tuning strategies and practical trade-offs. arXiv preprint arXiv:2504.07567 (2025) 14 E. G. Maminta, R. O. Atienza

work page arXiv 2025
[7]

In: International Conference on Learning Representations (2022)

Dehghani, M., Tay, Y., Arnab, A., Beyer, L., Vaswani, A.: The efficiency misnomer. In: International Conference on Learning Representations (2022)

work page 2022
[8]

In: International Conference on Learning Representations (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

work page 2021
[9]

https://grocery-vision.github.io/past_challenge/iccv2025.html (2025), accessed: 2025-11-27

Fan, Q., Li, W., Miao, S., Ma, S.: The 4th groceryvision challenge: Iccv25 retailvision workshop. https://grocery-vision.github.io/past_challenge/iccv2025.html (2025), accessed: 2025-11-27

work page 2025
[10]

Advances in Neural Information Processing Systems36, 27092–27112 (2023)

Gadre, S.Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., et al.: Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems36, 27092–27112 (2023)

work page 2023
[11]

Grattafiori, A., et al.: The llama 3 herd of models (2024), https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp. 770–778 (2016)

work page 2016
[13]

In: Proceedings of the 36th International Conference on Neural Information Processing Systems

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., et al.: Training compute- optimal large language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. pp. 30016–30030 (2022)

work page 2022
[14]

Ilharco, M

Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: Openclip (Jul 2021). https://doi.org/10.5281/zenodo.5143773, https://doi.org/10.5281/zenodo.5143773

work page doi:10.5281/zenodo.5143773 2021
[15]

In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Khattab, O., Zaharia, M.: ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 39–48 (2020)

work page 2020
[16]

In: 2011 International conference on computer vision

Leutenegger, S., Chli, M., Siegwart, R.Y.: Brisk: Binary robust invariant scalable keypoints. In: 2011 International conference on computer vision. pp. 2548–2555. IEEE (2011)

work page 2011
[17]

Advances in Neural Information Processing Systems35, 9287–9301 (2022)

Li, C., Liu, H., Li, L., Zhang, P., Aneja, J., Yang, J., Jin, P., Hu, H., Liu, Z., Lee, Y.J., et al.: Elevater: A benchmark and toolkit for evaluating language-augmented visual models. Advances in Neural Information Processing Systems35, 9287–9301 (2022)

work page 2022
[18]

Advances in Neural Information Processing Systems36, 49068–49087 (2023)

Li, X., Wang, Z., Xie, C.: An inverse scaling law for clip training. Advances in Neural Information Processing Systems36, 49068–49087 (2023)

work page 2023
[19]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1907
[20]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11976–11986 (2022)

work page 2022
[21]

International journal of computer vision60(2), 91–110 (2004)

Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International journal of computer vision60(2), 91–110 (2004)

work page 2004
[22]

SmolVLM: Redefining small and efficient multimodal models

Marafioti, A., Zohar, O., Farré, M., Noyan, M., Bakouch, E., Cuenca, P., Zakka, C., Allal, L.B., Lozhkov, A., Tazi, N., et al.: Smolvlm: Redefining small and efficient multimodal models. arXiv preprint arXiv:2504.05299 (2025) Grocery Product Retrieval with Open Source VLMs 15

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Passage Re-ranking with BERT

Nogueira, R., Cho, K.: Passage re-ranking with BERT. arXiv preprint arXiv:1901.04085 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1901
[24]

Representation Learning with Contrastive Predictive Coding

Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

arXiv preprint arXiv:2006.12634 (2020)

Peng, J., Xiao, C., Li, Y.: Rp2k: A large-scale retail product dataset for fine-grained image classification. arXiv preprint arXiv:2006.12634 (2020)

work page arXiv 2006
[26]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

work page 2021
[27]

Advances in neural information processing systems35, 25278–25294 (2022)

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems35, 25278–25294 (2022)

work page 2022
[28]

In: Interna- tional Conference on Image Analysis and Recognition

Srivastava, M.M.: Bag of tricks for retail product image classification. In: Interna- tional Conference on Image Analysis and Recognition. pp. 71–82. Springer (2020)

work page 2020
[29]

arXiv preprint arXiv:2312.10282 (2023)

Srivastava, M.M.: Retailklip: Finetuning openclip backbone using metric learning on a single gpu for zero-shot retail product image classification. arXiv preprint arXiv:2312.10282 (2023)

work page arXiv 2023
[30]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops

Srivastava, S., Wu, K.: Sgbd: Sharpness-aware mirror gradient with blip-based denoising for robust multimodal product recommendation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. pp. 2380–2389 (2025)

work page 2025
[31]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Communications of the ACM59(2), 64–73 (2016)

Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: Yfcc100m: The new data in multimedia research. Communications of the ACM59(2), 64–73 (2016)

work page 2016
[33]

Computer Vision and Image Understanding182, 81–92 (2019)

Tonioni, A., Di Stefano, L.: Domain invariant hierarchical embedding for grocery products recognition. Computer Vision and Image Understanding182, 81–92 (2019)

work page 2019
[34]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., Hénaff, O., Harm- sen, J., Steiner, A., Zhai, X.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Vasu, P.K.A., Pouransari, H., Faghri, F., Vemulapalli, R., Tuzel, O.: Mobileclip: Fast image-text models through multi-modal reinforced training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 15963–15974 (2024)

work page 2024
[36]

arXiv preprint arXiv:2309.01859 (2023)

Visheratin, A.: Nllb-clip–train performant multilingual image retrieval model on a budget. arXiv preprint arXiv:2309.01859 (2023)

work page arXiv 2023
[37]

In: International Conference on Learning Representations (2024)

Xu, H., Xie, S., Tan, X., Huang, P.Y., Howes, R., Sharma, V., Li, S.W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.: Demystifying clip data. In: International Conference on Learning Representations (2024)

work page 2024
[38]

Transactions on Machine Learning Research (2022)

Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research (2022)

work page 2022
[39]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre- training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)

work page 2023

[1] [1]

Products-10K: A large-scale product recognition dataset

Bai, Y., Chen, Y., Yu, W., Wang, L., Zhang, W.: Products-10k: A large-scale product recognition dataset. arXiv preprint arXiv:2008.10545 (2020)

work page arXiv 2008

[2] [2]

In: Belgrave, D., Zhang, C., Lin, H., Pascanu, R., Koniusz, P., Ghassemi, M., Chen, N

Bolya, D., Huang, P.Y., Sun, P., Cho, J.H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Bangalath, H., Wang, J., Monteiro, M., Xu, H., Dong, S., Ravi, N., Li, S.W., Dollar, P., Feichtenhofer, C.: Perception encoder: The best visual embeddings are not at the output of the network. In: Belgrave, D., Zhang, C., Lin, H., Pascanu, R., Koniusz, P., ...

work page 2025

[3] [3]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, J., Yu, Q., Shen, X., Yuille, A., Chen, L.C.: Vitamin: Designing scalable vision models in the vision-language era. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12954–12966 (2024)

work page 2024

[4] [4]

In: International Conference on Learning Rep- resentations (ICLR) (2023)

Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al.: Pali: A jointly-scaled multilingual language-image model. In: International Conference on Learning Rep- resentations (ICLR) (2023)

work page 2023

[5] [5]

Advances in Neural Information Processing Systems38, 48009–48036 (2026)

Chuang, Y.S., Li, Y., Wang, D., Yeh, C.F., Lyu, K., Raghavendra, R., Glass, J., Huang, L., Weston, J., Zettlemoyer, L., et al.: Meta clip 2: A worldwide scaling recipe. Advances in Neural Information Processing Systems38, 48009–48036 (2026)

work page 2026

[6] [6]

arXiv preprint arXiv:2504.07567 (2025) 14 E

Czerwinska, U., Bircanoglu, C., Chamoux, J.: Benchmarking image embeddings for e-commerce: Evaluating off-the shelf foundation models, fine-tuning strategies and practical trade-offs. arXiv preprint arXiv:2504.07567 (2025) 14 E. G. Maminta, R. O. Atienza

work page arXiv 2025

[7] [7]

In: International Conference on Learning Representations (2022)

Dehghani, M., Tay, Y., Arnab, A., Beyer, L., Vaswani, A.: The efficiency misnomer. In: International Conference on Learning Representations (2022)

work page 2022

[8] [8]

In: International Conference on Learning Representations (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

work page 2021

[9] [9]

https://grocery-vision.github.io/past_challenge/iccv2025.html (2025), accessed: 2025-11-27

Fan, Q., Li, W., Miao, S., Ma, S.: The 4th groceryvision challenge: Iccv25 retailvision workshop. https://grocery-vision.github.io/past_challenge/iccv2025.html (2025), accessed: 2025-11-27

work page 2025

[10] [10]

Advances in Neural Information Processing Systems36, 27092–27112 (2023)

Gadre, S.Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., et al.: Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems36, 27092–27112 (2023)

work page 2023

[11] [11]

Grattafiori, A., et al.: The llama 3 herd of models (2024), https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp. 770–778 (2016)

work page 2016

[13] [13]

In: Proceedings of the 36th International Conference on Neural Information Processing Systems

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., et al.: Training compute- optimal large language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. pp. 30016–30030 (2022)

work page 2022

[14] [14]

Ilharco, M

Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: Openclip (Jul 2021). https://doi.org/10.5281/zenodo.5143773, https://doi.org/10.5281/zenodo.5143773

work page doi:10.5281/zenodo.5143773 2021

[15] [15]

In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Khattab, O., Zaharia, M.: ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 39–48 (2020)

work page 2020

[16] [16]

In: 2011 International conference on computer vision

Leutenegger, S., Chli, M., Siegwart, R.Y.: Brisk: Binary robust invariant scalable keypoints. In: 2011 International conference on computer vision. pp. 2548–2555. IEEE (2011)

work page 2011

[17] [17]

Advances in Neural Information Processing Systems35, 9287–9301 (2022)

Li, C., Liu, H., Li, L., Zhang, P., Aneja, J., Yang, J., Jin, P., Hu, H., Liu, Z., Lee, Y.J., et al.: Elevater: A benchmark and toolkit for evaluating language-augmented visual models. Advances in Neural Information Processing Systems35, 9287–9301 (2022)

work page 2022

[18] [18]

Advances in Neural Information Processing Systems36, 49068–49087 (2023)

Li, X., Wang, Z., Xie, C.: An inverse scaling law for clip training. Advances in Neural Information Processing Systems36, 49068–49087 (2023)

work page 2023

[19] [19]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1907

[20] [20]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11976–11986 (2022)

work page 2022

[21] [21]

International journal of computer vision60(2), 91–110 (2004)

Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International journal of computer vision60(2), 91–110 (2004)

work page 2004

[22] [22]

SmolVLM: Redefining small and efficient multimodal models

Marafioti, A., Zohar, O., Farré, M., Noyan, M., Bakouch, E., Cuenca, P., Zakka, C., Allal, L.B., Lozhkov, A., Tazi, N., et al.: Smolvlm: Redefining small and efficient multimodal models. arXiv preprint arXiv:2504.05299 (2025) Grocery Product Retrieval with Open Source VLMs 15

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Passage Re-ranking with BERT

Nogueira, R., Cho, K.: Passage re-ranking with BERT. arXiv preprint arXiv:1901.04085 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1901

[24] [24]

Representation Learning with Contrastive Predictive Coding

Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [25]

arXiv preprint arXiv:2006.12634 (2020)

Peng, J., Xiao, C., Li, Y.: Rp2k: A large-scale retail product dataset for fine-grained image classification. arXiv preprint arXiv:2006.12634 (2020)

work page arXiv 2006

[26] [26]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

work page 2021

[27] [27]

Advances in neural information processing systems35, 25278–25294 (2022)

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems35, 25278–25294 (2022)

work page 2022

[28] [28]

In: Interna- tional Conference on Image Analysis and Recognition

Srivastava, M.M.: Bag of tricks for retail product image classification. In: Interna- tional Conference on Image Analysis and Recognition. pp. 71–82. Springer (2020)

work page 2020

[29] [29]

arXiv preprint arXiv:2312.10282 (2023)

Srivastava, M.M.: Retailklip: Finetuning openclip backbone using metric learning on a single gpu for zero-shot retail product image classification. arXiv preprint arXiv:2312.10282 (2023)

work page arXiv 2023

[30] [30]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops

Srivastava, S., Wu, K.: Sgbd: Sharpness-aware mirror gradient with blip-based denoising for robust multimodal product recommendation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. pp. 2380–2389 (2025)

work page 2025

[31] [31]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Communications of the ACM59(2), 64–73 (2016)

Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: Yfcc100m: The new data in multimedia research. Communications of the ACM59(2), 64–73 (2016)

work page 2016

[33] [33]

Computer Vision and Image Understanding182, 81–92 (2019)

Tonioni, A., Di Stefano, L.: Domain invariant hierarchical embedding for grocery products recognition. Computer Vision and Image Understanding182, 81–92 (2019)

work page 2019

[34] [34]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., Hénaff, O., Harm- sen, J., Steiner, A., Zhai, X.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Vasu, P.K.A., Pouransari, H., Faghri, F., Vemulapalli, R., Tuzel, O.: Mobileclip: Fast image-text models through multi-modal reinforced training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 15963–15974 (2024)

work page 2024

[36] [36]

arXiv preprint arXiv:2309.01859 (2023)

Visheratin, A.: Nllb-clip–train performant multilingual image retrieval model on a budget. arXiv preprint arXiv:2309.01859 (2023)

work page arXiv 2023

[37] [37]

In: International Conference on Learning Representations (2024)

Xu, H., Xie, S., Tan, X., Huang, P.Y., Howes, R., Sharma, V., Li, S.W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.: Demystifying clip data. In: International Conference on Learning Representations (2024)

work page 2024

[38] [38]

Transactions on Machine Learning Research (2022)

Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research (2022)

work page 2022

[39] [39]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre- training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)

work page 2023