FoodX-251: A Dataset for Fine-grained Food Classification
Pith reviewed 2026-05-24 22:09 UTC · model grok-4.3
The pith
FoodX-251 supplies 251 fine-grained food categories and 158k web images to train and test deep models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce FoodX-251, a dataset of 251 fine-grained food categories with 158k images collected from the web. We use 118k images as a training set and provide human verified labels for 40k images that can be used for validation and testing. The procedure of creating this dataset is outlined and relevant baselines with deep learning models are provided.
What carries the argument
The FoodX-251 dataset, which organizes 251 categories into a web-sourced training split plus a human-verified validation and test split to serve as a benchmark resource.
If this is right
- Models trained on the 118k images can be evaluated reliably on the 40k verified set for fair comparisons across methods.
- The dataset enables challenges and shared benchmarks that standardize progress measurement in fine-grained food tasks.
- The outlined collection and verification procedure can be repeated to expand the number of categories or images over time.
Where Pith is reading between the lines
- The same web-scraping plus verification pipeline could be applied to create comparable resources for other domains with high visual similarity, such as plant species or vehicle models.
- Combining the image set with recipe text or ingredient lists might produce multimodal models that outperform image-only baselines on the same categories.
- If models reach high accuracy on FoodX-251, they become candidates for deployment in mobile apps that log meals from photos.
Load-bearing premise
Web-sourced images, once filtered and human-verified according to the described steps, form a representative and correctly labeled collection that improves training and evaluation of food classification models.
What would settle it
Re-inspect a random sample of the 40k verified images for label errors or train several standard deep models and measure whether accuracy improves substantially over prior smaller food datasets; high error rates or flat performance gains would undermine the claim.
Figures
read the original abstract
Food classification is a challenging problem due to the large number of categories, high visual similarity between different foods, as well as the lack of datasets for training state-of-the-art deep models. Solving this problem will require advances in both computer vision models as well as datasets for evaluating these models. In this paper we focus on the second aspect and introduce FoodX-251, a dataset of 251 fine-grained food categories with 158k images collected from the web. We use 118k images as a training set and provide human verified labels for 40k images that can be used for validation and testing. In this work, we outline the procedure of creating this dataset and provide relevant baselines with deep learning models. The FoodX-251 dataset has been used for organizing iFood-2019 challenge in the Fine-Grained Visual Categorization workshop (FGVC6 at CVPR 2019) and is available for download.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FoodX-251, a dataset of 251 fine-grained food categories containing 158k web-collected images. It designates 118k images as a training set and supplies human-verified labels on 40k images for validation and testing. The manuscript outlines the dataset creation procedure, reports baseline results obtained with deep learning models, and notes that the dataset was used to organize the iFood-2019 challenge at FGVC6 (CVPR 2019).
Significance. If the collection and verification procedures produce a representative sample with reliably accurate labels, FoodX-251 would constitute a useful addition to the set of resources available for fine-grained visual categorization, particularly for food images where existing datasets are limited. The fact that the dataset has already supported an organized challenge supplies independent evidence of its practical utility for benchmarking.
major comments (2)
- [Abstract] Abstract: the statement that the 40k images carry 'human verified labels' is presented without any accompanying description of the verification protocol, number of annotators per image, inter-annotator agreement statistics, or quantitative label-accuracy measurements. These details are load-bearing for the claim that the split can be used for reliable validation and testing of state-of-the-art models.
- [Abstract] Abstract: baselines with deep learning models are asserted to be provided, yet the supplied text contains neither numerical performance figures, error analysis, nor comparisons against prior food datasets. Without these results it is impossible to gauge whether FoodX-251 poses a meaningfully harder or more representative benchmark.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We address each comment below and will revise the manuscript to improve the abstract's informativeness while preserving its length constraints.
read point-by-point responses
-
Referee: [Abstract] Abstract: the statement that the 40k images carry 'human verified labels' is presented without any accompanying description of the verification protocol, number of annotators per image, inter-annotator agreement statistics, or quantitative label-accuracy measurements. These details are load-bearing for the claim that the split can be used for reliable validation and testing of state-of-the-art models.
Authors: We agree that the abstract would be strengthened by briefly summarizing the verification protocol. The full manuscript (Section 3) outlines the human verification process for the 40k images. We will revise the abstract to include a concise reference to this protocol. Inter-annotator agreement statistics and quantitative label-accuracy measurements were not collected during dataset creation; the verification followed the multi-annotator protocol described in the main text. We can add a note on this if the referee considers it necessary. revision: yes
-
Referee: [Abstract] Abstract: baselines with deep learning models are asserted to be provided, yet the supplied text contains neither numerical performance figures, error analysis, nor comparisons against prior food datasets. Without these results it is impossible to gauge whether FoodX-251 poses a meaningfully harder or more representative benchmark.
Authors: The full manuscript reports baseline results with deep learning models, including numerical performance figures, in the experiments section, along with error analysis and comparisons to prior food datasets. We will revise the abstract to incorporate key numerical results and a brief statement on benchmark characteristics to better convey its utility. revision: yes
Circularity Check
No significant circularity; dataset release paper with no derivation chain
full rationale
The paper introduces FoodX-251 as a new dataset collected from the web with human verification. No mathematical derivations, equations, fitted parameters, or predictions are present. The central contribution is the data collection procedure and release itself, which does not reduce to any self-referential construction or self-citation load-bearing step. External use in the iFood-2019 challenge provides independent support. This is a standard non-circular dataset paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Fitbit app. https://www.fitbit.com/app. Ac- cessed: 2017-11-14. 1
work page 2017
-
[2]
https://play.google.com/store/ apps/details?id=com.dietcoacher.sos
My diet coach. https://play.google.com/store/ apps/details?id=com.dietcoacher.sos. Ac- cessed: 2017-11-14. 1
work page 2017
-
[3]
Myfitnesspal. https://www.myfitnesspal.com. Accessed: 2017-11-14. 1
work page 2017
-
[4]
Segmentation and recognition of multi-food meal images for carbohydrate counting
Marios Anthimopoulos, Joachim Dehais, Peter Diem, and Stavroula Mougiakakou. Segmentation and recognition of multi-food meal images for carbohydrate counting. In BIBE, pages 1–4. IEEE, 2013. 2
work page 2013
-
[5]
Leveraging context to sup- port automated food recognition in restaurants
Vinay Bettadapura, Edison Thomaz, Aman Parnami, Gre- gory D Abowd, and Irfan Essa. Leveraging context to sup- port automated food recognition in restaurants. In WACV, pages 580–587. IEEE, 2015. 2
work page 2015
-
[6]
Natural lan- guage processing with Python: analyzing text with the natu- ral language toolkit
Steven Bird, Ewan Klein, and Edward Loper. Natural lan- guage processing with Python: analyzing text with the natu- ral language toolkit. ” O’Reilly Media, Inc.”, 2009. 3
work page 2009
-
[7]
Food-101–mining discriminative components with random forests
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In ECCV, pages 446–461. Springer, 2014. 2, 3 6https://github.com/karansikka1/iFood 2019 7https://github.com/karansikka1/Foodx
work page 2014
-
[8]
Automatic chinese food identification and quantity estimation
Mei-Yun Chen, Yung-Hsiang Yang, Chia-Ju Ho, Shih-Han Wang, Shane-Ming Liu, Eugene Chang, Che-Hua Yeh, and Ming Ouhyoung. Automatic chinese food identification and quantity estimation. In SIGGRAPH Asia 2012 Technical Briefs, page 29. ACM, 2012. 2
work page 2012
-
[9]
Webly supervised learn- ing of convolutional networks
Xinlei Chen and Abhinav Gupta. Webly supervised learn- ing of convolutional networks. In ICCV, pages 1431–1439,
-
[10]
ChineseFoodNet: A large-scale Image Dataset for Chinese Food Recognition
Xin Chen, Yu Zhu, Hua Zhou, Liang Diao, and Dongyan Wang. Chinesefoodnet: A large-scale image dataset for chi- nese food recognition. arXiv preprint arXiv:1705.02743 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Rethinking the mobile food journal: Exploring op- portunities for lightweight photo-based capture
Felicia Cordeiro, Elizabeth Bales, Erin Cherry, and James Fogarty. Rethinking the mobile food journal: Exploring op- portunities for lightweight photo-based capture. In HFCS, pages 3207–3216. ACM, 2015. 1
work page 2015
-
[12]
J Deng, A Berg, S Satheesh, H Su, A Khosla, and L Fei-Fei. Ilsvrc-2012, 2012. URL http://www. image-net. org/challenges/LSVRC, 2012. 3
work page 2012
-
[13]
Retrieval and classi- fication of food images.Computers in biology and medicine, 77:23–39, 2016
Giovanni Maria Farinella, Dario Allegra, Marco Moltisanti, Filippo Stanco, and Sebastiano Battiato. Retrieval and classi- fication of food images.Computers in biology and medicine, 77:23–39, 2016. 2
work page 2016
-
[14]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016. 3
work page 2016
-
[15]
Image recog- nition of 85 food categories by feature fusion
Hajime Hoashi, Taichi Joutou, and Keiji Yanai. Image recog- nition of 85 food categories by feature fusion. In ISM, pages 296–301. IEEE, 2010. 2
work page 2010
-
[16]
A food image recognition system with multiple kernel learning
Taichi Joutou and Keiji Yanai. A food image recognition system with multiple kernel learning. In ICIP, pages 285–
-
[17]
Combining Weakly and Webly Supervised Learning for Classifying Food Images
Parneet Kaur, Karan Sikka, and Ajay Divakaran. Combining weakly and webly supervised learning for classifying food images. arXiv preprint arXiv:1712.08730, 2017. 2
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[18]
Automatic expansion of a food image dataset leveraging existing categories with domain adaptation
Yoshiyuki Kawano and Keiji Yanai. Automatic expansion of a food image dataset leveraging existing categories with domain adaptation. In ECCV, pages 3–17, 2014. 2
work page 2014
-
[19]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014. 1
work page 2014
-
[21]
Im2calories: towards an automated mobile vision food diary
Austin Meyers, Nick Johnston, Vivek Rathod, Anoop Korat- tikara, Alex Gorban, Nathan Silberman, Sergio Guadarrama, George Papandreou, Jonathan Huang, and Kevin P Murphy. Im2calories: towards an automated mobile vision food diary. In ICCV, pages 1233–1241, 2015. 1, 2
work page 2015
-
[22]
Nutrinet: a deep learning food and drink image recognition system for dietary assessment
Simon Mezgec and Barbara Korou ˇsi´c Seljak. Nutrinet: a deep learning food and drink image recognition system for dietary assessment. Nutrients, 9(7):657, 2017. 2
work page 2017
-
[23]
Wordnet: a lexical database for english
George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995. 3
work page 1995
-
[24]
Recognition and volume estimation of food intake using a mobile device
Manika Puri, Zhiwei Zhu, Qian Yu, Ajay Divakaran, and Harpreet Sawhney. Recognition and volume estimation of food intake using a mobile device. In WACV, pages 1–8. IEEE, 2009. 1
work page 2009
-
[25]
Training Convolutional Networks with Noisy Labels
Sainbayar Sukhbaatar and Rob Fergus. Learning from noisy labels with deep neural networks. arXiv preprint arXiv:1406.2080, 2(3):4, 2014. 2
work page internal anchor Pith review Pith/arXiv arXiv 2080
-
[26]
Recipe recognition with large multi- modal food dataset
Xin Wang, Devinder Kumar, Nicolas Thome, Matthieu Cord, and Frederic Precioso. Recipe recognition with large multi- modal food dataset. In ICMEW, pages 1–6. IEEE, 2015. 2
work page 2015
-
[27]
Annotating images by mining image search results
Xin-Jing Wang, Lei Zhang, Xirong Li, and Wei-Ying Ma. Annotating images by mining image search results. TPAMI, 30(11):1919–1932, 2008. 2
work page 1919
-
[28]
snap-n-eat food recognition and nu- trition estimation on a smartphone
Weiyu Zhang, Qian Yu, Behjat Siddiquie, Ajay Divakaran, and Harpreet Sawhney. snap-n-eat food recognition and nu- trition estimation on a smartphone. JDST, 9(3):525–533,
-
[29]
Places: A 10 million image database for scene recognition
Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. TPAMI, 2017. 1
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.