pith. sign in

arxiv: 2504.06925 · v1 · pith:7IGMHAM7new · submitted 2025-04-09 · 💻 cs.CV · cs.AI

Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition

Pith reviewed 2026-05-22 19:57 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language modelsfood image recognitiondietary assessmentExpert-Weighted RecallFoodNExTDBclosed-source modelsfine-grained classification
0
0 comments X

The pith

Closed-source vision-language models reach over 90 percent expert-weighted recall on single-product food images and outperform open-source alternatives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests six vision-language models on food recognition tasks that would matter for automatic dietary assessment from photos. It introduces the FoodNExTDB database of 9,263 expert-labeled images spanning categories, subcategories, and cooking styles, plus fifty thousand nutritional annotations from seven experts. The authors also define an Expert-Weighted Recall metric that adjusts scores for differences among those annotators. Results indicate closed-source models handle simple single-item cases well while all models still falter on fine details such as cooking methods or visually similar foods.

Core claim

Closed-source models such as ChatGPT, Gemini, and Claude achieve over 90 percent EWR when identifying food products in single-item images, whereas open-source models lag. The evaluation rests on the new FoodNExTDB collection and the EWR metric that incorporates inter-annotator variability. The work shows that current VLMs remain limited in fine-grained recognition of cooking styles and similar-looking items, limiting their immediate use for reliable automatic dietary assessment.

What carries the argument

The FoodNExTDB database of expert-annotated images together with the Expert-Weighted Recall metric that accounts for annotator differences when scoring model outputs at multiple levels of food detail.

If this is right

  • VLMs could already support basic dietary logging tools when images contain only one clear food item.
  • Further work on distinguishing cooking styles and similar foods would be required before broader reliability.
  • The public FoodNExTDB collection gives other teams a shared benchmark for testing new models.
  • The performance difference between closed and open models suggests practical choices in building nutrition applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Apps could pair closed-source VLMs with simple user confirmation for plates containing multiple foods.
  • The closed-source advantage may affect how widely accessible AI dietary tools become in the near term.
  • Extending the same evaluation to images from varied lighting or cultural cuisines would test whether the observed limits persist.

Load-bearing premise

The FoodNExTDB database with its expert annotations and the Expert-Weighted Recall metric form a valid and representative benchmark for how vision-language models would perform in real dietary assessment.

What would settle it

A follow-up test on the same models using non-expert labels or everyday multi-item photos that produces substantially lower EWR scores would show the benchmark overstates practical performance.

Figures

Figures reproduced from arXiv: 2504.06925 by Aythami Morales, Blanca Lacruz-Pleguezuelos, Enrique Carrillo de Santa Pau, Guadalupe X.Baz\'an, Isabel Espinosa-Salinas, Javier Ortega-Garcia, Julian Fierrez, Laura Judith Marcos Zambrano, Ruben Tolosana, Sergio Romero-Tapiador.

Figure 1
Figure 1. Figure 1: Overview of the proposed framework. (A) The FoodNExTDB consists of 9,263 food images labeled by nutrition experts across [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the proposed Expert-Weighted Recall (EWR) computation for a food image [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples of VLMs predictions compared to nutritionist’s annotations. (A) A multi-component dish where some experts identify [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Radar charts illustrating VLM performance in fine-grained food recognition. We include some examples of all available classes [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Automatic dietary assessment based on food images remains a challenge, requiring precise food detection, segmentation, and classification. Vision-Language Models (VLMs) offer new possibilities by integrating visual and textual reasoning. In this study, we evaluate six state-of-the-art VLMs (ChatGPT, Gemini, Claude, Moondream, DeepSeek, and LLaVA), analyzing their capabilities in food recognition at different levels. For the experimental framework, we introduce the FoodNExTDB, a unique food image database that contains 9,263 expert-labeled images across 10 categories (e.g., "protein source"), 62 subcategories (e.g., "poultry"), and 9 cooking styles (e.g., "grilled"). In total, FoodNExTDB includes 50k nutritional labels generated by seven experts who manually annotated all images in the database. Also, we propose a novel evaluation metric, Expert-Weighted Recall (EWR), that accounts for the inter-annotator variability. Results show that closed-source models outperform open-source ones, achieving over 90% EWR in recognizing food products in images containing a single product. Despite their potential, current VLMs face challenges in fine-grained food recognition, particularly in distinguishing subtle differences in cooking styles and visually similar food items, which limits their reliability for automatic dietary assessment. The FoodNExTDB database is publicly available at https://github.com/AI4Food/FoodNExtDB.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces FoodNExTDB, a dataset of 9,263 expert-annotated food images spanning 10 categories, 62 subcategories, and 9 cooking styles, along with 50k nutritional labels from seven experts. It evaluates six VLMs (ChatGPT, Gemini, Claude, Moondream, DeepSeek, LLaVA) on food recognition using a new Expert-Weighted Recall (EWR) metric designed to account for inter-annotator variability, reporting that closed-source models achieve over 90% EWR on single-product images while noting limitations in fine-grained recognition of cooking styles and similar items.

Significance. The public release of FoodNExTDB and the EWR metric represent a concrete contribution to benchmarking VLMs for food image tasks. If the single-product results generalize and EWR correlates with downstream dietary assessment utility, the work could help identify gaps in current models. However, the restriction to single-product images and absence of external validation against nutrient estimation errors on multi-item meals reduce the immediate applicability to real-world dietary assessment.

major comments (3)
  1. [Abstract] The headline result (>90% EWR for closed-source models) is reported only for single-product images (Abstract), yet the introduction frames the study as addressing automatic dietary assessment, which typically involves multi-item plates; no results or analysis on multi-product images are described to support the readiness claim.
  2. [Abstract] The EWR metric is presented as accounting for inter-annotator variability (Abstract), but no formula, weighting scheme, or comparison to standard recall is provided; without this, it is unclear whether EWR provides a meaningfully different or more robust evaluation than conventional metrics.
  3. [Experimental framework] The dataset contains 9,263 images across 10 categories with expert annotations, but the evaluation is restricted to single-product subsets without reported data splits, error analysis by category, or correlation of EWR scores with actual nutrient intake prediction error on held-out real meals.
minor comments (2)
  1. [Abstract] The abstract lists example categories and subcategories but does not include a summary table of image counts per category or inter-annotator agreement statistics; adding this would improve clarity of the dataset contribution.
  2. [Abstract] The GitHub link for FoodNExTDB is provided, but the manuscript does not specify the exact license or any usage restrictions for the 50k nutritional labels.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] The headline result (>90% EWR for closed-source models) is reported only for single-product images (Abstract), yet the introduction frames the study as addressing automatic dietary assessment, which typically involves multi-item plates; no results or analysis on multi-product images are described to support the readiness claim.

    Authors: We agree that real-world dietary assessment typically involves multi-item plates. Our study deliberately focuses on single-product images to provide a controlled benchmark of VLM food recognition capabilities. We will revise the abstract to explicitly qualify the >90% EWR result as applying to single-product images and expand the introduction to discuss the gap for multi-item scenarios without overstating readiness for full dietary assessment. revision: partial

  2. Referee: [Abstract] The EWR metric is presented as accounting for inter-annotator variability (Abstract), but no formula, weighting scheme, or comparison to standard recall is provided; without this, it is unclear whether EWR provides a meaningfully different or more robust evaluation than conventional metrics.

    Authors: The EWR formula and weighting scheme based on inter-annotator agreement are defined in the Methods section. We will add a brief description of the EWR formula and a direct comparison to standard recall in the abstract and results to clarify its advantages. revision: yes

  3. Referee: [Experimental framework] The dataset contains 9,263 images across 10 categories with expert annotations, but the evaluation is restricted to single-product subsets without reported data splits, error analysis by category, or correlation of EWR scores with actual nutrient intake prediction error on held-out real meals.

    Authors: We will report data splits and include error analysis by category in the revised experimental section. Correlation of EWR with nutrient intake prediction error on held-out real meals is not performed in this work, as it would require additional multi-item meal data and downstream nutrient validation experiments outside the current benchmarking scope. revision: partial

standing simulated objections not resolved
  • Correlation of EWR scores with actual nutrient intake prediction error on held-out real meals

Circularity Check

0 steps flagged

No significant circularity: purely empirical evaluation against external expert labels

full rationale

This is an empirical evaluation study that introduces the FoodNExTDB dataset with 50k expert-generated nutritional labels and defines the EWR metric to account for inter-annotator variability. Performance results (e.g., >90% EWR for closed-source VLMs on single-product images) are computed directly by comparing model outputs to the independent expert annotations. There are no mathematical derivations, fitted parameters renamed as predictions, self-citation load-bearing premises, uniqueness theorems, or ansatzes smuggled via citation. The work is self-contained against external benchmarks with no reduction of claims to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the validity of expert annotations as ground truth and the appropriateness of the new EWR metric for dietary assessment evaluation.

axioms (1)
  • domain assumption Expert manual annotations by seven experts provide accurate and consistent labels for food categories, subcategories, cooking styles, and nutritional information.
    The entire evaluation framework and EWR metric depend on these labels serving as reliable ground truth.
invented entities (1)
  • Expert-Weighted Recall (EWR) metric no independent evidence
    purpose: To evaluate model performance while accounting for inter-annotator variability among experts.
    This is a novel metric introduced in the paper to handle variability in expert labels.

pith-pipeline@v0.9.0 · 5852 in / 1262 out tokens · 70195 ms · 2026-05-22T19:57:28.729648+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 8 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774,

  2. [2]

    Food Recognition using Fusion of Classifiers Based on CNNs

    Eduardo Aguilar, Marc Bola ˜nos, and Petia Radeva. Food Recognition using Fusion of Classifiers Based on CNNs. In Proc. of the International Conference on Image Analysis and Processing, 2017. 2

  3. [3]

    Gemini: A Family of Highly Capable Multimodal Models

    Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, et al. Gemini: A Family of Highly Capable Multimodal Models. arXiv preprint arXiv:2312.11805, 2023. 1, 4

  4. [4]

    The Claude 3 Model Family: Opus, Sonnet, Haiku

    AI Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku. Claude-3 Model Card, 1:1, 2024. 4

  5. [5]

    Li, Adrien Bardes, Suzanne Petryk, Oscar Ma ˜nas, et al

    Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C Li, Adrien Bardes, Suzanne Petryk, Oscar Ma˜nas, Zhiqiu Lin, Anas Mahmoud, Bargav Jayaraman, et al. An Introduction to Vision-Language Modeling. arXiv preprint arXiv:2405.17247, 2024. 2

  6. [6]

    Food-101 – Mining Discriminative Components with Ran- dom Forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – Mining Discriminative Components with Ran- dom Forests. In Proc. of the European Conference on Com- puter Vision, 2014. 2

  7. [7]

    Recognition of Food Images Based on Transfer Learning and Ensemble Learning

    Le Bu, Caiping Hu, and Xiuliang Zhang. Recognition of Food Images Based on Transfer Learning and Ensemble Learning. Plos One, 19(1):e0296789, 2024. 1, 3

  8. [8]

    Deep-based Ingredient Recognition for Cooking Recipe Retrieval

    Jingjing Chen and Chong-Wah Ngo. Deep-based Ingredient Recognition for Cooking Recipe Retrieval. In Proc. of the International Conference on Multimedia, 2016. 2

  9. [9]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. arXiv preprint arXiv:2501.17811,

  10. [10]

    Food Recognition: A New Dataset, Experiments, and Results

    Gianluigi Ciocca, Paolo Napoletano, and Raimondo Schet- tini. Food Recognition: A New Dataset, Experiments, and Results. IEEE Journal of Biomedical and Health Informat- ics, 21(3):588–598, 2016. 2

  11. [11]

    How Good is ChatGPT at Face Biometrics? a First Look into Recognition, Soft Biometrics, and Explain- ability

    Ivan Deandres-Tame, Ruben Tolosana, Ruben Vera- Rodriguez, Aythami Morales, Julian Fierrez, and Javier Ortega-Garcia. How Good is ChatGPT at Face Biometrics? a First Look into Recognition, Soft Biometrics, and Explain- ability. IEEE Access, 12:34390–34401, 2024. 1

  12. [12]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Petko Georgiev, Ving Ian Lei, Ryan Burnell, et al. Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context. arXiv preprint arXiv:2403.05530 ,

  13. [13]

    Health–environment Efficiency of Diets Shows Nonlinear Trends over 1990–2011

    Pan He, Zhu Liu, Giovanni Baiocchi, Dabo Guan, Yan Bai, and Klaus Hubacek. Health–environment Efficiency of Diets Shows Nonlinear Trends over 1990–2011. Nature Food, 5 (2):116–124, 2024. 1

  14. [14]

    Squeeze-and-Excitation Net- works

    Jie Hu, Li Shen, and Gang Sun. Squeeze-and-Excitation Net- works. In Proc. of the Conference on Computer Vision and Pattern Recognition, 2018. 3

  15. [15]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. GPT-4o System Card. arXiv preprint arXiv:2410.21276, 2024. 4

  16. [16]

    Kawano and K

    Y . Kawano and K. Yanai. Automatic Expansion of a Food Image Dataset Leveraging Existing Categories with Domain Adaptation. In Proc. of the Workshop on Transferring and Adapting Source Knowledge in Computer Vision, 2014. 2

  17. [17]

    Multimodal Food Image Classification with Large Language Models

    Jun-Hwa Kim, Nam-Ho Kim, Donghyeok Jo, and Chee Sun Won. Multimodal Food Image Classification with Large Language Models. Electronics, 13(22), 2024. 3

  18. [18]

    BLIP-2: Bootstrapping Language-image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping Language-image Pre-training with Frozen Image Encoders and Large Language Models. In Proc. of the International Conference on Machine Learning,

  19. [19]

    VILA: On Pre-training for Vi- sual Language Models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. VILA: On Pre-training for Vi- sual Language Models. In Proc. of the Conference on Com- puter Vision and Pattern Recognition, 2024. 2

  20. [20]

    Perspec- tive: Data in Personalized Nutrition: Bridging Biomedi- cal, Psycho-behavioral, and Food Environment Approaches for Population-wide Impact

    Jakob Linseisen, Britta Renner, Kurt Gedrich, Jan Wirsam, Christina Holzapfel, Stefan Lorkowski, Bernhard Watzl, Hannelore Daniel, Michael Leitzmann, et al. Perspec- tive: Data in Personalized Nutrition: Bridging Biomedi- cal, Psycho-behavioral, and Food Environment Approaches for Population-wide Impact. Advances in Nutrition , page 100377, 2025. 1

  21. [21]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 Technical Report. arXiv preprint arXiv:2412.19437, 2024. 1

  22. [22]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved Baselines with Visual Instruction Tuning. InProc. of the Conference on Computer Vision and Pattern Recogni- tion, 2024. 4

  23. [23]

    Research on Food Image Recognition of Deep Learning Algorithms

    Lihua Luo. Research on Food Image Recognition of Deep Learning Algorithms. In Proc. of the International Confer- ence on Computers, Information Processing and Advanced Education, 2023. 1, 3

  24. [24]

    Ahuja, and Cheng-I Wei

    Peihua Ma, Shawn Tsai, Yiyang He, Xiaoxue Jia, Dongyang Zhen, Ning Yu, Qin Wang, Jaspreet K.C. Ahuja, and Cheng-I Wei. Large Language Models in Food Science: Innovations, Applications, and Future. Trends in Food Science & Tech- nology, 148:104488, 2024. 1

  25. [25]

    Integrating Vision-Language Models for Accelerated High- Throughput Nutrition Screening

    Peihua Ma, Yixin Wu, Ning Yu, Xiaoxue Jia, Yiyang He, Yang Zhang, Michael Backes, Qin Wang, and Cheng-I Wei. Integrating Vision-Language Models for Accelerated High- Throughput Nutrition Screening. Advanced Science, 11(34): 2403578, 2024. 3

  26. [26]

    Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evalu- ating Vision-Language Models

    Zheng Ma, Mianzhi Pan, Wenhan Wu, Kanzhi Cheng, Jian- bing Zhang, Shujian Huang, and Jiajun Chen. Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evalu- ating Vision-Language Models. In Proc. of the International Conference on Multimedia, 2023. 3

  27. [27]

    Matsuda, H

    Y . Matsuda, H. Hoashi, and K. Yanai. Recognition of Multiple-Food Images by Detecting Candidate Regions. In Proc. of the International Conference on Multimedia and Expo, 2012. 2

  28. [28]

    Patrick McAllister, Huiru Zheng, Raymond Bond, and Anne Moorhead. Combining Deep Residual Neural Network Fea- tures with Supervised Machine Learning Algorithms to Clas- sify Diverse Food Image Datasets.Computers in Biology and Medicine, 95:217–233, 2018. 3

  29. [29]

    A Survey on Food Computing

    Weiqing Min, Shuqiang Jiang, Linhu Liu, Yong Rui, and Ramesh Jain. A Survey on Food Computing. ACM Com- puting Surveys, 52(5):1–36, 2019. 1

  30. [30]

    ISIA Food-500: A Dataset for Large-Scale Food Recognition via Stacked Global-Local Attention Network

    Weiqing Min, Linhu Liu, Zhiling Wang, Zhengdong Luo, Xiaoming Wei, Xiaolin Wei, and Shuqiang Jiang. ISIA Food-500: A Dataset for Large-Scale Food Recognition via Stacked Global-Local Attention Network. In Proc. of the In- ternational Conference on Multimedia, 2020. 2, 3

  31. [31]

    Large Scale Visual Food Recognition

    Weiqing Min, Zhiling Wang, Yuxin Liu, Mengjiang Luo, Liping Kang, Xiaoming Wei, Xiaolin Wei, and Shuqiang Jiang. Large Scale Visual Food Recognition. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 45(8): 9932–9949, 2023. 3

  32. [32]

    Llava-chef: A Multi- modal Generative Model for Food Recipes

    Fnu Mohbat and Mohammed J Zaki. Llava-chef: A Multi- modal Generative Model for Food Recipes. In Proc. of the International Conference on Information and Knowledge Management, 2024. 3

  33. [33]

    An Explainable CNN and Vision Transformer-Based Approach for Real-Time Food Recognition

    Kintoh Allen Nfor, Tagne Poupi Theodore Armand, Kenes- baeva Periyzat Ismaylovna, Moon-Il Joo, and Hee-Cheol Kim. An Explainable CNN and Vision Transformer-Based Approach for Real-Time Food Recognition. Nutrients, 17 (2):362, 2025. 3

  34. [34]

    Using LLMs to Extract Food Entities from Cooking Recipes

    Vasiliki Pitsilou, George Papadakis, and Dimitrios Skoutas. Using LLMs to Extract Food Entities from Cooking Recipes. In Proc. of the International Conference on Data Engineer- ing Workshops, 2024. 1

  35. [35]

    FoodGPT: A Large Language Model in Food Test- ing Domain with Incremental Pre-training and Knowledge Graph Prompt

    Zhixiao Qi, Yijiong Yu, Meiqi Tu, Junyi Tan, and Yongfeng Huang. FoodGPT: A Large Language Model in Food Test- ing Domain with Incremental Pre-training and Knowledge Graph Prompt. arXiv preprint arXiv:2308.10173, 2023. 1

  36. [36]

    Learning Transferable Visual Models from Natural Language Super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Transferable Visual Models from Natural Language Super- vision. In Proc. of the International Conference on Machine Learning, 2021. 3

  37. [37]

    Dining on Details: LLM-Guided Expert Net- works for Fine-Grained Food Recognition

    Jes ´us M Rodr´ıguez-de Vera, Pablo Villacorta, Imanol G Es- tepa, Marc Bola˜nos, Ignacio Saras´ua, Bhalaji Nagarajan, and Petia Radeva. Dining on Details: LLM-Guided Expert Net- works for Fine-Grained Food Recognition. InProc. of the In- ternational Workshop on Multimedia Assisted Dietary Man- agement, 2023. 1

  38. [38]

    LOFI: LOng- tailed FIne-Grained Network for Food Recognition

    Jes ´us M Rodr ´ıguez-De-Vera, Imanol G Estepa, Marc Bola˜nos, Bhalaji Nagarajan, and Petia Radeva. LOFI: LOng- tailed FIne-Grained Network for Food Recognition. In Proc. of the Conference on Computer Vision and Pattern Recogni- tion, 2024. 3

  39. [39]

    AI4FoodDB: A Database for Per- sonalized e-Health Nutrition and Lifestyle through Wear- able Devices and Artificial Intelligence

    Sergio Romero-Tapiador, Blanca Lacruz-Pleguezuelos, Ruben Tolosana, et al. AI4FoodDB: A Database for Per- sonalized e-Health Nutrition and Lifestyle through Wear- able Devices and Artificial Intelligence. Database, 2023: baad049, 2023. 2, 3

  40. [40]

    AI4Food-NutritionFW: A Novel Frame- work for the Automatic Synthesis and Analysis of Eating Behaviours

    Sergio Romero-Tapiador, Ruben Tolosana, Aythami Morales, et al. AI4Food-NutritionFW: A Novel Frame- work for the Automatic Synthesis and Analysis of Eating Behaviours. IEEE Access, 1:112199 – 112211, 2023. 1

  41. [41]

    Leveraging Automatic Personalised Nu- trition: Food Image Recognition Benchmark and Dataset Based on Nutrition Taxonomy

    Sergio Romero-Tapiador, Ruben Tolosana, Aythami Morales, et al. Leveraging Automatic Personalised Nu- trition: Food Image Recognition Benchmark and Dataset Based on Nutrition Taxonomy. Multimedia Tools and Applications, 84:1945–1966, 2024. 1, 3

  42. [42]

    Personalized Weight Loss Management through Wearable Devices and Artificial Intelligence

    Sergio Romero-Tapiador, Ruben Tolosana, Aythami Morales, et al. Personalized Weight Loss Management through Wearable Devices and Artificial Intelligence. arXiv preprint arXiv:2409.08700, 2024. 8

  43. [43]

    Losing Visual Needles in Image Haystacks: Vision Lan- guage Models are Easily Distracted in Short and Long Con- texts

    Aditya Sharma, Michael Saxon, and William Yang Wang. Losing Visual Needles in Image Haystacks: Vision Lan- guage Models are Easily Distracted in Short and Long Con- texts. arXiv preprint arXiv:2406.16851, 2024. 2

  44. [44]

    A Lightweight Hybrid Model with Location-preserving ViT for Efficient Food Recognition

    Guorui Sheng, Weiqing Min, Xiangyi Zhu, Liang Xu, Qing- shuo Sun, Yancun Yang, Lili Wang, and Shuqiang Jiang. A Lightweight Hybrid Model with Location-preserving ViT for Efficient Food Recognition. Nutrients, 16(2):200, 2024. 1, 3

  45. [45]

    Why and How the Indo-Mediterranean Diet May Be Superior to Other Diets: The Role of Antioxidants in the Diet

    Ram B Singh, Jan Fedacko, Ghizal Fatima, Aminat Magomedova, Shaw Watanabe, and Galal Elkilany. Why and How the Indo-Mediterranean Diet May Be Superior to Other Diets: The Role of Antioxidants in the Diet. Nutrients, 14 (4):898, 2022. 1

  46. [46]

    Food/Non-Food Image Classification and Food Categoriza- tion Using Pre-Trained GoogLeNet Model

    Ashutosh Singla, Lin Yuan, and Touradj Ebrahimi. Food/Non-Food Image Classification and Food Categoriza- tion Using Pre-Trained GoogLeNet Model. In Proc. of the International Workshop on Multimedia Assisted Dietary Management, 2016. 2

  47. [47]

    Nutrition5k: Towards Automatic Nutritional Understanding of Generic Food

    Quin Thames, Arjun Karpur, Wade Norris, Fangting Xia, Liviu Panait, Tobias Weyand, and Jack Sim. Nutrition5k: Towards Automatic Nutritional Understanding of Generic Food. In Proc. of the Conference on Computer Vision and Pattern Recognition, 2021. 2

  48. [48]

    Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling Visual Encod- ing for Unified Multimodal Understanding and Generation. arXiv preprint arXiv:2410.13848, 2024. 4

  49. [49]

    ChatDiet: Empowering Personalized Nutrition- oriented Food Recommender Chatbots through an LLM- Augmented Framework

    Zhongqi Yang, Elahe Khatibi, Nitish Nagesh, Mahyar Ab- basian, Iman Azimi, Ramesh Jain, and Amir M Rah- mani. ChatDiet: Empowering Personalized Nutrition- oriented Food Recommender Chatbots through an LLM- Augmented Framework. Smart Health , 32:100465, 2024. 1

  50. [50]

    FoodLMM: A Versatile Food Assistant Using Large Multi-modal Model

    Yuehao Yin, Huiyan Qi, Bin Zhu, Jingjing Chen, Yu-Gang Jiang, and Chong-Wah Ngo. FoodLMM: A Versatile Food Assistant Using Large Multi-modal Model. arXiv preprint arXiv:2312.14991, 2023. 3

  51. [51]

    LLM-based Hierarchical Label Anno- tation for Foodborne Illness Detection on Social Media

    Dongyu Zhang, Ruofan Hu, Dandan Tao, Hao Feng, and Elke Rundensteiner. LLM-based Hierarchical Label Anno- tation for Foodborne Illness Detection on Social Media. In Proc. of the International Conference on Big Data, 2024. 1

  52. [52]

    Influence of Foods and Nutrition on the Gut Microbiome and Implications for Intestinal Health

    Ping Zhang. Influence of Foods and Nutrition on the Gut Microbiome and Implications for Intestinal Health. Interna- tional Journal of Molecular Sciences, 23(17):9588, 2022. 1