FoodCHA: Multi-Modal LLM Agent for Fine-Grained Food Analysis

Onat Gungor; Pranav Mekkoth; Tajana Rosing; Woojin Lee; Ye Tian

arxiv: 2605.05499 · v1 · submitted 2026-05-06 · 💻 cs.AI

FoodCHA: Multi-Modal LLM Agent for Fine-Grained Food Analysis

Woojin Lee , Pranav Mekkoth , Ye Tian , Onat Gungor , Tajana Rosing This is my paper

Pith reviewed 2026-05-08 16:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords food recognitionmultimodal agentfine-grained classificationhierarchical decision makingvision language modelcooking styledietary monitoring

0 comments

The pith

Hierarchical anchoring lets a compact 2B vision model beat larger ones on food subcategory and cooking style tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FoodCHA as a way to break food recognition into chained decisions rather than a single forward pass. It starts by placing an image into a broad category, then narrows to the subcategory using that anchor, and finally assigns cooking style from the subcategory. The goal is to reduce inconsistent or non-canonical labels that open-ended vision-language models often produce when images show multiple items or high visual similarity. The framework runs this chain on the small Moondream-2B model and reports clear gains over an 11B baseline on the FoodNExTDB dataset, especially for the hardest attribute. If the chaining works as described, fine-grained food details become feasible without large models or heavy compute.

Core claim

FoodCHA reformulates food recognition as a hierarchical decision-making process. By progressively anchoring predictions, FoodCHA guides subcategory identification using high-level categories and guides cooking style recognition using subcategories, improving semantic consistency and attribute-level discrimination. It utilizes the compact Moondream-2B vision language model to achieve higher precision than larger models on category, subcategory, and cooking style tasks.

What carries the argument

The progressive anchoring mechanism that chains high-level category predictions to guide and constrain subcategory and cooking-style classifications.

If this is right

Category recognition precision rises 13.8 percent over the Food-Llama-3.2-11B baseline.
Subcategory recognition precision rises 38.2 percent.
Cooking style classification precision rises 153.2 percent.
The approach stays practical on devices because it uses a 2B-parameter model with lower memory and compute needs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchoring pattern could be tested on other fine-grained domains such as plant identification or clothing attributes where broad labels help disambiguate details.
If early-step errors do not compound, the method offers a route to high-precision specialized agents without scaling model size.
Mobile dietary apps could incorporate the chain for real-time multi-item meal logging with consistent style and ingredient tags.

Load-bearing premise

That correct high-level category predictions will reliably improve lower-level accuracy without passing on early mistakes to the rest of the chain.

What would settle it

Measuring whether subcategory and cooking-style accuracy collapses on the subset of test images where the initial high-level category is wrong.

Figures

Figures reproduced from arXiv: 2605.05499 by Onat Gungor, Pranav Mekkoth, Tajana Rosing, Woojin Lee, Ye Tian.

**Figure 1.** Figure 1: Overview of the FoodCHA framework. evaluation and deployment [23], [35]. These challenges highlight the need for predictions that are structured, ontologycompliant, and consistent across hierarchy levels. Addressing these limitations requires a reasoning framework that explicitly enforces hierarchical dependencies and validates intermediate outputs. Agentic orchestration provides a natural solution: by … view at source ↗

**Figure 2.** Figure 2: System-level pipeline of FoodCHA. An input image is processed by a backbone model and passed to an agent that view at source ↗

**Figure 4.** Figure 4: Cooking-style analysis. Recall and EWR for each cooking-style label. Cooking Style Performance view at source ↗

read the original abstract

The widespread adoption of camera-equipped mobile devices and wearables has enabled convenient capture of meal images, making food recognition a key component for real time dietary monitoring. However, real-world food images present challenges due to high intra-class similarity and the frequent presence of multiple food items within a single image. While deep learning models achieve strong performance in coarse grained classification, they often struggle to capture fine-grained attributes such as cooking style. Moreover, open-ended generation in modern vision-language models can produce non-canonical labels, limiting their practical deployment. We propose FoodCHA, a multimodal agentic framework that reformulates food recognition as a hierarchical decision-making process. By progressively anchoring predictions, FoodCHA guides subcategory identification using high-level categories and guides cooking style recognition using subcategories, improving semantic consistency and attribute-level discrimination. To ensure practical deployability, FoodCHA utilizes the compact Moondream-2B vision language model, which provides strong reasoning capability while maintaining lower computational and memory overhead. Experiments on FoodNExTDB show that FoodCHA outperforms Food-Llama-3.2-11B by 13.8% and 38.2% in category and subcategory recognition precision, respectively, and achieves a striking 153.2% improvement in cooking style classification precision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FoodCHA stages food recognition hierarchically with a 2B VLM and claims large gains over an 11B baseline, but the abstract leaves the gains and error handling unverified.

read the letter

The main point is that this paper takes food image analysis and turns it into a three-stage agent process: first category, then subcategory guided by the category, then cooking style guided by the subcategory. They run it on the compact Moondream-2B model instead of a bigger one and report clear precision lifts on FoodNExTDB, especially the 153% jump on cooking style over Food-Llama-3.2-11B. That hierarchical anchoring is the concrete new piece they add to existing multimodal LLM work for this domain.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces FoodCHA, a multi-modal agentic framework that reformulates food recognition as a hierarchical decision-making process. Using the compact Moondream-2B vision-language model, it progressively anchors subcategory identification to high-level category predictions and cooking-style recognition to subcategory predictions. Experiments on the FoodNExTDB dataset claim precision improvements of 13.8% in category recognition, 38.2% in subcategory recognition, and 153.2% in cooking-style classification over the Food-Llama-3.2-11B baseline.

Significance. If the reported gains are substantiated with full experimental details, ablations, and error analysis, the work could meaningfully advance practical, low-overhead systems for real-time dietary monitoring from meal images. The emphasis on hierarchical consistency and deployability with a 2B-scale model addresses key limitations of open-ended VLM generation and intra-class similarity in food imagery.

major comments (3)

[Abstract] Abstract: The central performance claims (13.8%, 38.2%, 153.2% precision gains) are stated without baseline absolute scores, statistical tests, dataset statistics, or any error analysis, leaving the empirical support for the hierarchical framework unverifiable from the provided text.
[Abstract / Experiments] The hierarchical anchoring mechanism (high-level categories guiding subcategories, which then guide cooking styles) is load-bearing for the claimed semantic consistency gains, yet no per-stage accuracy breakdowns, error-propagation analysis, or ablation isolating the anchoring effect from base model capability are supplied. This directly engages the risk that initial errors from the 2B Moondream model systematically bias downstream stages.
[Experiments] No comparison is provided between FoodCHA and a non-hierarchical version of the same Moondream-2B model, making it impossible to attribute the reported gains specifically to the agentic hierarchical process rather than differences in model scale or prompting.

minor comments (2)

[Abstract] The FoodNExTDB dataset is referenced without citation, size, class distribution, or image characteristics, which are needed to contextualize the results.
[Abstract] The phrase 'striking 153.2% improvement' should be clarified as relative versus absolute gain and accompanied by the corresponding baseline precision value.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where the comments identify gaps in verifiability or attribution, we have revised the manuscript to incorporate the requested details, ablations, and analyses.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (13.8%, 38.2%, 153.2% precision gains) are stated without baseline absolute scores, statistical tests, dataset statistics, or any error analysis, leaving the empirical support for the hierarchical framework unverifiable from the provided text.

Authors: We agree that the abstract's brevity limits immediate verifiability. In the revised manuscript we have updated the abstract to report the absolute precision scores for both Food-Llama-3.2-11B and FoodCHA on each task, added the key dataset statistics (number of images, categories, subcategories, and cooking styles in FoodNExTDB), and included a reference to the statistical significance testing performed. A dedicated error analysis subsection has also been added to the Experiments section. revision: yes
Referee: [Abstract / Experiments] The hierarchical anchoring mechanism (high-level categories guiding subcategories, which then guide cooking styles) is load-bearing for the claimed semantic consistency gains, yet no per-stage accuracy breakdowns, error-propagation analysis, or ablation isolating the anchoring effect from base model capability are supplied. This directly engages the risk that initial errors from the 2B Moondream model systematically bias downstream stages.

Authors: The referee correctly highlights the importance of demonstrating the hierarchical mechanism's contribution. We have added per-stage accuracy breakdowns (category, subcategory, and cooking-style) in a new table, together with an explicit error-propagation analysis that measures how anchoring reduces downstream error rates relative to independent stage predictions. An ablation isolating the anchoring effect (full FoodCHA versus the same Moondream-2B model without hierarchical guidance) is now included in Section 4.3. revision: yes
Referee: [Experiments] No comparison is provided between FoodCHA and a non-hierarchical version of the same Moondream-2B model, making it impossible to attribute the reported gains specifically to the agentic hierarchical process rather than differences in model scale or prompting.

Authors: We acknowledge that a same-model non-hierarchical baseline is the most direct way to isolate the agentic contribution. While the original submission emphasized comparison against a larger model to underscore deployability, the revised Experiments section now includes a direct ablation of FoodCHA against a flat (non-hierarchical) prompting baseline that uses identical Moondream-2B weights and similar prompting style. This new comparison shows that the hierarchical decision process yields measurable gains beyond base-model capability and prompting differences alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external dataset against baseline

full rationale

The paper proposes FoodCHA as a hierarchical agentic framework that reformulates food recognition via progressive anchoring of subcategory and cooking-style predictions using high-level categories from Moondream-2B. Performance claims consist of direct experimental comparisons on the named FoodNExTDB dataset against an external baseline model (Food-Llama-3.2-11B), reporting specific precision gains. No mathematical derivations, equations, fitted parameters presented as predictions, self-referential definitions, or load-bearing self-citations appear in the described method or results. The central claims rest on external benchmarks and a public dataset rather than reducing to the framework's own inputs by construction, rendering the evaluation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the unproven effectiveness of the hierarchical anchoring mechanism and the reasoning capability of the chosen compact model; no explicit free parameters are stated, but implicit assumptions about model behavior and dataset representativeness are required.

axioms (1)

domain assumption The Moondream-2B vision-language model possesses sufficient reasoning capability to perform accurate hierarchical food attribute recognition when guided by category anchors.
Invoked to justify deployment of the compact model for the full pipeline.

invented entities (1)

FoodCHA hierarchical agentic framework no independent evidence
purpose: To reformulate food recognition as progressive subcategory and attribute identification.
The proposed system itself, with no independent evidence of correctness beyond the reported experiments.

pith-pipeline@v0.9.0 · 5538 in / 1598 out tokens · 82142 ms · 2026-05-08T16:07:16.012544+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 6 canonical work pages

[1]

https://www.guardrailsai

Mahyar Abbasian, Iman Azimi, Amir M Rahmani, and Ramesh Jain. Conversational health agents: A personalized llm-powered agent frame- work.arXiv preprint arXiv:2310.02374, 2023

work page arXiv 2023
[2]

Conversational health agents: a personalized large language model- powered agent framework.JAMIA open, 8(4):ooaf067, 2025

Mahyar Abbasian, Iman Azimi, Amir M Rahmani, and Ramesh Jain. Conversational health agents: a personalized large language model- powered agent framework.JAMIA open, 8(4):ooaf067, 2025

2025
[3]

Automatic food recognition us- ing deep convolutional neural networks with self-attention mechanism

Rahib Abiyev and Joseph Adepoju. Automatic food recognition us- ing deep convolutional neural networks with self-attention mechanism. Human-Centric Intelligent Systems, 4(1):171–186, 2024

2024
[4]

Adaptllm/food-llama-3.2-11b-vision-instruct

AdaptLLM. Adaptllm/food-llama-3.2-11b-vision-instruct. https:// huggingface.co/AdaptLLM/food-Llama-3.2-11B-Vision-Instruct, 2025. Hugging Face model card, accessed 2026-02-24

2025
[5]

A review on food recognition technology for health applications.Health psychology research, 8(3):9297, 2020

Dario Allegra, Sebastiano Battiato, Alessandro Ortis, Salvatore Urso, and Riccardo Polosa. A review on food recognition technology for health applications.Health psychology research, 8(3):9297, 2020

2020
[6]

Mobile and wearable sensors for data-driven health monitoring system: State-of-the-art and future prospect.Expert Systems with Applications, 202:117362, 2022

Chioma Virginia Anikwe, Henry Friday Nweke, Anayo Chukwu Ikegwu, Chukwunonso Adolphus Egwuonwu, Fergus Uchenna Onu, Uzoma Rita Alo, and Ying Wah Teh. Mobile and wearable sensors for data-driven health monitoring system: State-of-the-art and future prospect.Expert Systems with Applications, 202:117362, 2022

2022
[7]

Twist & scout: Grounding multimodal llm-experts by forget-free tuning

Aritra Bhowmik, Mohammad Mahdi Derakhshani, Dennis Koelma, Yuki M Asano, Martin R Oswald, and Cees GM Snoek. Twist & scout: Grounding multimodal llm-experts by forget-free tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1359–1368, 2025

2025
[8]

Res-vmamba: Fine-grained food category visual classification using selective state space models with deep residual learning.arXiv preprint arXiv:2402.15761, 2024

Chi-Sheng Chen, Guan-Ying Chen, Dong Zhou, Di Jiang, and Dai-Shi Chen. Res-vmamba: Fine-grained food category visual classification using selective state space models with deep residual learning.arXiv preprint arXiv:2402.15761, 2024

work page arXiv 2024
[9]

Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme

Daixuan Cheng, Shaohan Huang, and Furu Wei. Adapting large language models to domains via reading comprehension.arXiv preprint arXiv:2309.09530, 2023

work page arXiv 2023
[10]

On domain- adaptive post-training for multimodal large language models, 2024

Daixuan Cheng, Shaohan Huang, Ziyu Zhu, Xintong Zhang, Wayne Xin Zhao, Zhongzhi Luan, Bo Dai, and Zhenliang Zhang. On domain- adaptive post-training for multimodal large language models, 2024

2024
[11]

Food recognition for dietary assessment using deep convolutional neural networks

Stergios Christodoulidis, Marios Anthimopoulos, and Stavroula Mougiakakou. Food recognition for dietary assessment using deep convolutional neural networks. InInternational conference on image analysis and processing, pages 458–465. Springer, 2015

2015
[12]

Intelligent agent for food recognition in a smart fridge

Florin Dumitrescu, Adina Magda Florea, Mihai Tr ˘asc˘au, and Alexandru Sorici. Intelligent agent for food recognition in a smart fridge. In2022 24th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), pages 220–225. IEEE, 2022

2022
[13]

Human visual system vs convolution neural networks in food recognition task: An empirical comparison.Computer Vision and Image Understanding, 191:102878, 2020

Pedro Furtado, Manuel Caldeira, and Pedro Martins. Human visual system vs convolution neural networks in food recognition task: An empirical comparison.Computer Vision and Image Understanding, 191:102878, 2020

2020
[14]

Improving food image recognition with noisy vision transformer.arXiv preprint arXiv:2503.18997, 2025

Tonmoy Ghosh and Edward Sazonov. Improving food image recognition with noisy vision transformer.arXiv preprint arXiv:2503.18997, 2025

work page arXiv 2025
[15]

An integrated lightweight neural network design and fpga-accelerated edge computing for chili pepper variety and origin identification via an e-nose.Foods, 14(15):2612, 2025

Ziyu Guo, Yong Yin, Haolin Gu, Guihua Peng, Xueya Wang, Ju Chen, and Jia Yan. An integrated lightweight neural network design and fpga-accelerated edge computing for chili pepper variety and origin identification via an e-nose.Foods, 14(15):2612, 2025

2025
[16]

Squeeze-and-excitation networks

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018

2018
[17]

Enhancing food recognition accuracy using hybrid transformer models and image preprocessing techniques.Scientific Reports, 15(1):5591, 2025

BN Jagadesh, Srihari Varma Mantena, Asha P Sathe, T Prabhakara Rao, Kranthi Kumar Lella, Shyam Sunder Pabboju, and Ramesh Vatambeti. Enhancing food recognition accuracy using hybrid transformer models and image preprocessing techniques.Scientific Reports, 15(1):5591, 2025

2025
[18]

Food detection and recognition using convolutional neural network

Hokuto Kagaya, Kiyoharu Aizawa, and Makoto Ogawa. Food detection and recognition using convolutional neural network. InProceedings of the 22nd ACM international conference on Multimedia, pages 1085– 1088, 2014

2014
[19]

Fine-grained food image classification and recipe extraction using a customized deep neural network and nlp.Computers in Biology and Medicine, 175:108528, 2024

Razia Sulthana Abdul Kareem, Timothy Tilford, and Stoyan Stoyanov. Fine-grained food image classification and recipe extraction using a customized deep neural network and nlp.Computers in Biology and Medicine, 175:108528, 2024

2024
[20]

A cloud edge collaboration of food recognition using deep neural networks.Journal of Artificial Intelligence and Computing, 2(1):9–18, 2024

Muhammad Talha Khan and Muhammad Hassan Khan. A cloud edge collaboration of food recognition using deep neural networks.Journal of Artificial Intelligence and Computing, 2(1):9–18, 2024

2024
[21]

Deep learning approaches in food recognition

Chairi Kiourt, George Pavlidis, and Stella Markantonatou. Deep learning approaches in food recognition. InMachine learning paradigms: advances in deep learning-based technological applications, pages 83–
[22]

Corre- lation verification for image retrieval

Seongwon Lee, Hongje Seong, Suhyeon Lee, and Euntai Kim. Corre- lation verification for image retrieval. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5374– 5384, 2022

2022
[23]

VL-SAM-v2: Open-world object detection with general and specific query fusion.arXiv preprint arXiv:2505.18986, 2025

Zhiwei Lin and Yongtao Wang. Vl-sam-v2: Open-world object detection with general and specific query fusion.arXiv preprint arXiv:2505.18986, 2025

work page arXiv 2025
[24]

Deepfood: Deep learning-based food image recognition for computer-aided dietary assessment

Chang Liu, Yu Cao, Yan Luo, Guanling Chen, Vinod V okkarane, and Yunsheng Ma. Deepfood: Deep learning-based food image recognition for computer-aided dietary assessment. InInternational Conference on Smart Homes and Health Telematics, pages 37–48. Springer, 2016

2016
[25]

A new deep learning- based food recognition system for dietary assessment on an edge com- puting service infrastructure.IEEE Transactions on Services Computing, 11(2):249–261, 2017

Chang Liu, Yu Cao, Yan Luo, Guanling Chen, Vinod V okkarane, Ma Yunsheng, Songqing Chen, and Peng Hou. A new deep learning- based food recognition system for dietary assessment on an edge com- puting service infrastructure.IEEE Transactions on Services Computing, 11(2):249–261, 2017

2017
[26]

Food-500 cap: A fine-grained food caption benchmark for evaluating vision-language models, 2023

Zheng Ma, Mianzhi Pan, Wenhan Wu, Kanzhi Cheng, Jianbing Zhang, Shujian Huang, and Jiajun Chen. Food-500 cap: A fine-grained food caption benchmark for evaluating vision-language models, 2023

2023
[27]

An explorative analysis of svm classifier and resnet50 architecture on african food classification, 2025

Chinedu Emmanuel Mbonu, Kenechukwu Anigbogu, Doris Asogwa, and Tochukwu Belonwu. An explorative analysis of svm classifier and resnet50 architecture on african food classification, 2025

2025
[28]

Nutrinet: a deep learn- ing food and drink image recognition system for dietary assessment

Simon Mezgec and Barbara Korou ˇsi´c Seljak. Nutrinet: a deep learn- ing food and drink image recognition system for dietary assessment. Nutrients, 9(7):657, 2017

2017
[29]

Large scale visual food recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):9932–9949, 2023

Weiqing Min, Zhiling Wang, Yuxin Liu, Mengjiang Luo, Liping Kang, Xiaoming Wei, Xiaolin Wei, and Shuqiang Jiang. Large scale visual food recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):9932–9949, 2023

2023
[30]

The food recognition benchmark: Using deep learning to recognize food in images.Frontiers in Nutrition, 9:875143, 2022

Sharada Prasanna Mohanty, Gaurav Singhal, Eric Antoine Scuccimarra, Djilani Kebaili, Harris H ´eritier, Victor Boulanger, and Marcel Salath ´e. The food recognition benchmark: Using deep learning to recognize food in images.Frontiers in Nutrition, 9:875143, 2022

2022
[31]

A novel hierarchical edge computing solution based on deep learning for distributed image recognition in iot systems

Nitis Monburinon, Salahuddin Muhammad Salim Zabir, Natthasak Vech- prasit, Satoshi Utsumi, and Norio Shiratori. A novel hierarchical edge computing solution based on deep learning for distributed image recognition in iot systems. In2019 4th International Conference on Information Technology (InCIT), pages 294–299. IEEE, 2019

2019
[32]

Opengvlab/internvl3-8b

OpenGVLab. Opengvlab/internvl3-8b. https://huggingface.co/ OpenGVLab/InternVL3-8B, 2025. Hugging Face model card, accessed 2026-02-24

2025
[33]

Mobile multi-food recognition using deep learning.ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 13(3s):1–21, 2017

Parisa Pouladzadeh and Shervin Shirmohammadi. Mobile multi-food recognition using deep learning.ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 13(3s):1–21, 2017

2017
[34]

A novel svm based food recognition method for calorie measurement applications

Parisa Pouladzadeh, Gregorio Villalobos, Rana Almaghrabi, and Shervin Shirmohammadi. A novel svm based food recognition method for calorie measurement applications. In2012 IEEE international conference on multimedia and expo workshops, pages 495–498. IEEE, 2012

2012
[35]

Are vision-language models ready for dietary assessment? exploring the next frontier in ai-powered food image recognition

Sergio Romero-Tapiador, Ruben Tolosana, Blanca Lacruz-Pleguezuelos, Laura Judith Marcos-Zambrano, Guadalupe X Baz ´an, Isabel Espinosa- Salinas, Julian Fierrez, Javier Ortega-Garcia, Enrique Carrillo de Santa Pau, and Aythami Morales. Are vision-language models ready for dietary assessment? exploring the next frontier in ai-powered food image recognition....

2025
[36]

Foodai: Food image recognition via deep learning for smart food logging

Doyen Sahoo, Wang Hao, Shu Ke, Wu Xiongwei, Hung Le, Palakorn Achananuparp, Ee-Peng Lim, and Steven CH Hoi. Foodai: Food image recognition via deep learning for smart food logging. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2260–2268, 2019

2019
[37]

Study for food recognition system using deep learning

Nareen OM Salim, Subhi RM Zeebaree, Mohammed AM Sadeeq, AH Radie, Hanan M Shukur, and Zryan Najat Rashid. Study for food recognition system using deep learning. InJournal of Physics: Conference Series, volume 1963, page 012014. IOP Publishing, 2021

1963
[38]

The role of artificial intelligence in nutrition research: a scoping review.Nutrients, 16(13):2066, 2024

Andrea Sosa-Holwerda, Oak-Hee Park, Kembra Albracht-Schulte, Surya Niraula, Leslie Thompson, and Wilna Oldewage-Theron. The role of artificial intelligence in nutrition research: a scoping review.Nutrients, 16(13):2066, 2024

2066
[39]

Qwen2.5-vl technical report, 2025

Qwen Team. Qwen2.5-vl technical report, 2025

2025
[40]

Perspectives of dietary assessment in human health and disease.Nutrients, 14(4):830, 2022

Aida Turrini. Perspectives of dietary assessment in human health and disease.Nutrients, 14(4):830, 2022

2022
[41]

vikhyatk/moondream2

Vikhyat Kumar and contributors. vikhyatk/moondream2. https:// huggingface.co/vikhyatk/moondream2, 2025. Hugging Face model card, accessed 2026-02-24

2025
[42]

Foodsage: Addressing recognition uncertainty in automated dietary monitoring through human-robot dialogue

Yishu Wang, Fangyu Zhou, Xiaokang Han, Kecheng Yao, and Zhuy- ing Li. Foodsage: Addressing recognition uncertainty in automated dietary monitoring through human-robot dialogue. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, pages 1–7, 2025

2025
[43]

A closed-loop multi-agent system driven by llms for meal-level personalized nutrition management.arXiv preprint arXiv:2601.04491, 2026

Muqing Xu. A closed-loop multi-agent system driven by llms for meal-level personalized nutrition management.arXiv preprint arXiv:2601.04491, 2026

work page arXiv 2026
[44]

Food recognition and dietary assessment for healthcare system at mobile device end using mask r-cnn

Hui Ye and Qiming Zou. Food recognition and dietary assessment for healthcare system at mobile device end using mask r-cnn. In International Conference on Testbeds and Research Infrastructures, pages 18–35. Springer, 2019

2019
[45]

Foodlmm: A versatile food assistant using large multi- modal model, 2024

Yuehao Yin, Huiyan Qi, Bin Zhu, Jingjing Chen, Yu-Gang Jiang, and Chong-Wah Ngo. Foodlmm: A versatile food assistant using large multi- modal model, 2024

2024
[46]

Deep learning in food category recognition.Information Fusion, 98:101859, 2023

Yudong Zhang, Lijia Deng, Hengde Zhu, Wei Wang, Zeyu Ren, Qinghua Zhou, Siyuan Lu, Shiting Sun, Ziquan Zhu, Juan Manuel Gorriz, et al. Deep learning in food category recognition.Information Fusion, 98:101859, 2023

2023
[47]

Foodsky: A food- oriented large language model that can pass the chef and dietetic examinations.Patterns, 6(5):101234, 2025

Pengfei Zhou, Weiqing Min, Chaoran Fu, Ying Jin, Mingyu Huang, Xiangyang Li, Shuhuan Mei, and Shuqiang Jiang. Foodsky: A food- oriented large language model that can pass the chef and dietetic examinations.Patterns, 6(5):101234, 2025

2025
[48]

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models, 2025

Jinguo Zhu et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models, 2025

2025

[1] [1]

https://www.guardrailsai

Mahyar Abbasian, Iman Azimi, Amir M Rahmani, and Ramesh Jain. Conversational health agents: A personalized llm-powered agent frame- work.arXiv preprint arXiv:2310.02374, 2023

work page arXiv 2023

[2] [2]

Conversational health agents: a personalized large language model- powered agent framework.JAMIA open, 8(4):ooaf067, 2025

Mahyar Abbasian, Iman Azimi, Amir M Rahmani, and Ramesh Jain. Conversational health agents: a personalized large language model- powered agent framework.JAMIA open, 8(4):ooaf067, 2025

2025

[3] [3]

Automatic food recognition us- ing deep convolutional neural networks with self-attention mechanism

Rahib Abiyev and Joseph Adepoju. Automatic food recognition us- ing deep convolutional neural networks with self-attention mechanism. Human-Centric Intelligent Systems, 4(1):171–186, 2024

2024

[4] [4]

Adaptllm/food-llama-3.2-11b-vision-instruct

AdaptLLM. Adaptllm/food-llama-3.2-11b-vision-instruct. https:// huggingface.co/AdaptLLM/food-Llama-3.2-11B-Vision-Instruct, 2025. Hugging Face model card, accessed 2026-02-24

2025

[5] [5]

A review on food recognition technology for health applications.Health psychology research, 8(3):9297, 2020

Dario Allegra, Sebastiano Battiato, Alessandro Ortis, Salvatore Urso, and Riccardo Polosa. A review on food recognition technology for health applications.Health psychology research, 8(3):9297, 2020

2020

[6] [6]

Mobile and wearable sensors for data-driven health monitoring system: State-of-the-art and future prospect.Expert Systems with Applications, 202:117362, 2022

Chioma Virginia Anikwe, Henry Friday Nweke, Anayo Chukwu Ikegwu, Chukwunonso Adolphus Egwuonwu, Fergus Uchenna Onu, Uzoma Rita Alo, and Ying Wah Teh. Mobile and wearable sensors for data-driven health monitoring system: State-of-the-art and future prospect.Expert Systems with Applications, 202:117362, 2022

2022

[7] [7]

Twist & scout: Grounding multimodal llm-experts by forget-free tuning

Aritra Bhowmik, Mohammad Mahdi Derakhshani, Dennis Koelma, Yuki M Asano, Martin R Oswald, and Cees GM Snoek. Twist & scout: Grounding multimodal llm-experts by forget-free tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1359–1368, 2025

2025

[8] [8]

Res-vmamba: Fine-grained food category visual classification using selective state space models with deep residual learning.arXiv preprint arXiv:2402.15761, 2024

Chi-Sheng Chen, Guan-Ying Chen, Dong Zhou, Di Jiang, and Dai-Shi Chen. Res-vmamba: Fine-grained food category visual classification using selective state space models with deep residual learning.arXiv preprint arXiv:2402.15761, 2024

work page arXiv 2024

[9] [9]

Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme

Daixuan Cheng, Shaohan Huang, and Furu Wei. Adapting large language models to domains via reading comprehension.arXiv preprint arXiv:2309.09530, 2023

work page arXiv 2023

[10] [10]

On domain- adaptive post-training for multimodal large language models, 2024

Daixuan Cheng, Shaohan Huang, Ziyu Zhu, Xintong Zhang, Wayne Xin Zhao, Zhongzhi Luan, Bo Dai, and Zhenliang Zhang. On domain- adaptive post-training for multimodal large language models, 2024

2024

[11] [11]

Food recognition for dietary assessment using deep convolutional neural networks

Stergios Christodoulidis, Marios Anthimopoulos, and Stavroula Mougiakakou. Food recognition for dietary assessment using deep convolutional neural networks. InInternational conference on image analysis and processing, pages 458–465. Springer, 2015

2015

[12] [12]

Intelligent agent for food recognition in a smart fridge

Florin Dumitrescu, Adina Magda Florea, Mihai Tr ˘asc˘au, and Alexandru Sorici. Intelligent agent for food recognition in a smart fridge. In2022 24th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), pages 220–225. IEEE, 2022

2022

[13] [13]

Human visual system vs convolution neural networks in food recognition task: An empirical comparison.Computer Vision and Image Understanding, 191:102878, 2020

Pedro Furtado, Manuel Caldeira, and Pedro Martins. Human visual system vs convolution neural networks in food recognition task: An empirical comparison.Computer Vision and Image Understanding, 191:102878, 2020

2020

[14] [14]

Improving food image recognition with noisy vision transformer.arXiv preprint arXiv:2503.18997, 2025

Tonmoy Ghosh and Edward Sazonov. Improving food image recognition with noisy vision transformer.arXiv preprint arXiv:2503.18997, 2025

work page arXiv 2025

[15] [15]

An integrated lightweight neural network design and fpga-accelerated edge computing for chili pepper variety and origin identification via an e-nose.Foods, 14(15):2612, 2025

Ziyu Guo, Yong Yin, Haolin Gu, Guihua Peng, Xueya Wang, Ju Chen, and Jia Yan. An integrated lightweight neural network design and fpga-accelerated edge computing for chili pepper variety and origin identification via an e-nose.Foods, 14(15):2612, 2025

2025

[16] [16]

Squeeze-and-excitation networks

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018

2018

[17] [17]

Enhancing food recognition accuracy using hybrid transformer models and image preprocessing techniques.Scientific Reports, 15(1):5591, 2025

BN Jagadesh, Srihari Varma Mantena, Asha P Sathe, T Prabhakara Rao, Kranthi Kumar Lella, Shyam Sunder Pabboju, and Ramesh Vatambeti. Enhancing food recognition accuracy using hybrid transformer models and image preprocessing techniques.Scientific Reports, 15(1):5591, 2025

2025

[18] [18]

Food detection and recognition using convolutional neural network

Hokuto Kagaya, Kiyoharu Aizawa, and Makoto Ogawa. Food detection and recognition using convolutional neural network. InProceedings of the 22nd ACM international conference on Multimedia, pages 1085– 1088, 2014

2014

[19] [19]

Fine-grained food image classification and recipe extraction using a customized deep neural network and nlp.Computers in Biology and Medicine, 175:108528, 2024

Razia Sulthana Abdul Kareem, Timothy Tilford, and Stoyan Stoyanov. Fine-grained food image classification and recipe extraction using a customized deep neural network and nlp.Computers in Biology and Medicine, 175:108528, 2024

2024

[20] [20]

A cloud edge collaboration of food recognition using deep neural networks.Journal of Artificial Intelligence and Computing, 2(1):9–18, 2024

Muhammad Talha Khan and Muhammad Hassan Khan. A cloud edge collaboration of food recognition using deep neural networks.Journal of Artificial Intelligence and Computing, 2(1):9–18, 2024

2024

[21] [21]

Deep learning approaches in food recognition

Chairi Kiourt, George Pavlidis, and Stella Markantonatou. Deep learning approaches in food recognition. InMachine learning paradigms: advances in deep learning-based technological applications, pages 83–

[22] [22]

Corre- lation verification for image retrieval

Seongwon Lee, Hongje Seong, Suhyeon Lee, and Euntai Kim. Corre- lation verification for image retrieval. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5374– 5384, 2022

2022

[23] [23]

VL-SAM-v2: Open-world object detection with general and specific query fusion.arXiv preprint arXiv:2505.18986, 2025

Zhiwei Lin and Yongtao Wang. Vl-sam-v2: Open-world object detection with general and specific query fusion.arXiv preprint arXiv:2505.18986, 2025

work page arXiv 2025

[24] [24]

Deepfood: Deep learning-based food image recognition for computer-aided dietary assessment

Chang Liu, Yu Cao, Yan Luo, Guanling Chen, Vinod V okkarane, and Yunsheng Ma. Deepfood: Deep learning-based food image recognition for computer-aided dietary assessment. InInternational Conference on Smart Homes and Health Telematics, pages 37–48. Springer, 2016

2016

[25] [25]

A new deep learning- based food recognition system for dietary assessment on an edge com- puting service infrastructure.IEEE Transactions on Services Computing, 11(2):249–261, 2017

Chang Liu, Yu Cao, Yan Luo, Guanling Chen, Vinod V okkarane, Ma Yunsheng, Songqing Chen, and Peng Hou. A new deep learning- based food recognition system for dietary assessment on an edge com- puting service infrastructure.IEEE Transactions on Services Computing, 11(2):249–261, 2017

2017

[26] [26]

Food-500 cap: A fine-grained food caption benchmark for evaluating vision-language models, 2023

Zheng Ma, Mianzhi Pan, Wenhan Wu, Kanzhi Cheng, Jianbing Zhang, Shujian Huang, and Jiajun Chen. Food-500 cap: A fine-grained food caption benchmark for evaluating vision-language models, 2023

2023

[27] [27]

An explorative analysis of svm classifier and resnet50 architecture on african food classification, 2025

Chinedu Emmanuel Mbonu, Kenechukwu Anigbogu, Doris Asogwa, and Tochukwu Belonwu. An explorative analysis of svm classifier and resnet50 architecture on african food classification, 2025

2025

[28] [28]

Nutrinet: a deep learn- ing food and drink image recognition system for dietary assessment

Simon Mezgec and Barbara Korou ˇsi´c Seljak. Nutrinet: a deep learn- ing food and drink image recognition system for dietary assessment. Nutrients, 9(7):657, 2017

2017

[29] [29]

Large scale visual food recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):9932–9949, 2023

Weiqing Min, Zhiling Wang, Yuxin Liu, Mengjiang Luo, Liping Kang, Xiaoming Wei, Xiaolin Wei, and Shuqiang Jiang. Large scale visual food recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):9932–9949, 2023

2023

[30] [30]

The food recognition benchmark: Using deep learning to recognize food in images.Frontiers in Nutrition, 9:875143, 2022

Sharada Prasanna Mohanty, Gaurav Singhal, Eric Antoine Scuccimarra, Djilani Kebaili, Harris H ´eritier, Victor Boulanger, and Marcel Salath ´e. The food recognition benchmark: Using deep learning to recognize food in images.Frontiers in Nutrition, 9:875143, 2022

2022

[31] [31]

A novel hierarchical edge computing solution based on deep learning for distributed image recognition in iot systems

Nitis Monburinon, Salahuddin Muhammad Salim Zabir, Natthasak Vech- prasit, Satoshi Utsumi, and Norio Shiratori. A novel hierarchical edge computing solution based on deep learning for distributed image recognition in iot systems. In2019 4th International Conference on Information Technology (InCIT), pages 294–299. IEEE, 2019

2019

[32] [32]

Opengvlab/internvl3-8b

OpenGVLab. Opengvlab/internvl3-8b. https://huggingface.co/ OpenGVLab/InternVL3-8B, 2025. Hugging Face model card, accessed 2026-02-24

2025

[33] [33]

Mobile multi-food recognition using deep learning.ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 13(3s):1–21, 2017

Parisa Pouladzadeh and Shervin Shirmohammadi. Mobile multi-food recognition using deep learning.ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 13(3s):1–21, 2017

2017

[34] [34]

A novel svm based food recognition method for calorie measurement applications

Parisa Pouladzadeh, Gregorio Villalobos, Rana Almaghrabi, and Shervin Shirmohammadi. A novel svm based food recognition method for calorie measurement applications. In2012 IEEE international conference on multimedia and expo workshops, pages 495–498. IEEE, 2012

2012

[35] [35]

Are vision-language models ready for dietary assessment? exploring the next frontier in ai-powered food image recognition

Sergio Romero-Tapiador, Ruben Tolosana, Blanca Lacruz-Pleguezuelos, Laura Judith Marcos-Zambrano, Guadalupe X Baz ´an, Isabel Espinosa- Salinas, Julian Fierrez, Javier Ortega-Garcia, Enrique Carrillo de Santa Pau, and Aythami Morales. Are vision-language models ready for dietary assessment? exploring the next frontier in ai-powered food image recognition....

2025

[36] [36]

Foodai: Food image recognition via deep learning for smart food logging

Doyen Sahoo, Wang Hao, Shu Ke, Wu Xiongwei, Hung Le, Palakorn Achananuparp, Ee-Peng Lim, and Steven CH Hoi. Foodai: Food image recognition via deep learning for smart food logging. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2260–2268, 2019

2019

[37] [37]

Study for food recognition system using deep learning

Nareen OM Salim, Subhi RM Zeebaree, Mohammed AM Sadeeq, AH Radie, Hanan M Shukur, and Zryan Najat Rashid. Study for food recognition system using deep learning. InJournal of Physics: Conference Series, volume 1963, page 012014. IOP Publishing, 2021

1963

[38] [38]

The role of artificial intelligence in nutrition research: a scoping review.Nutrients, 16(13):2066, 2024

Andrea Sosa-Holwerda, Oak-Hee Park, Kembra Albracht-Schulte, Surya Niraula, Leslie Thompson, and Wilna Oldewage-Theron. The role of artificial intelligence in nutrition research: a scoping review.Nutrients, 16(13):2066, 2024

2066

[39] [39]

Qwen2.5-vl technical report, 2025

Qwen Team. Qwen2.5-vl technical report, 2025

2025

[40] [40]

Perspectives of dietary assessment in human health and disease.Nutrients, 14(4):830, 2022

Aida Turrini. Perspectives of dietary assessment in human health and disease.Nutrients, 14(4):830, 2022

2022

[41] [41]

vikhyatk/moondream2

Vikhyat Kumar and contributors. vikhyatk/moondream2. https:// huggingface.co/vikhyatk/moondream2, 2025. Hugging Face model card, accessed 2026-02-24

2025

[42] [42]

Foodsage: Addressing recognition uncertainty in automated dietary monitoring through human-robot dialogue

Yishu Wang, Fangyu Zhou, Xiaokang Han, Kecheng Yao, and Zhuy- ing Li. Foodsage: Addressing recognition uncertainty in automated dietary monitoring through human-robot dialogue. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, pages 1–7, 2025

2025

[43] [43]

A closed-loop multi-agent system driven by llms for meal-level personalized nutrition management.arXiv preprint arXiv:2601.04491, 2026

Muqing Xu. A closed-loop multi-agent system driven by llms for meal-level personalized nutrition management.arXiv preprint arXiv:2601.04491, 2026

work page arXiv 2026

[44] [44]

Food recognition and dietary assessment for healthcare system at mobile device end using mask r-cnn

Hui Ye and Qiming Zou. Food recognition and dietary assessment for healthcare system at mobile device end using mask r-cnn. In International Conference on Testbeds and Research Infrastructures, pages 18–35. Springer, 2019

2019

[45] [45]

Foodlmm: A versatile food assistant using large multi- modal model, 2024

Yuehao Yin, Huiyan Qi, Bin Zhu, Jingjing Chen, Yu-Gang Jiang, and Chong-Wah Ngo. Foodlmm: A versatile food assistant using large multi- modal model, 2024

2024

[46] [46]

Deep learning in food category recognition.Information Fusion, 98:101859, 2023

Yudong Zhang, Lijia Deng, Hengde Zhu, Wei Wang, Zeyu Ren, Qinghua Zhou, Siyuan Lu, Shiting Sun, Ziquan Zhu, Juan Manuel Gorriz, et al. Deep learning in food category recognition.Information Fusion, 98:101859, 2023

2023

[47] [47]

Foodsky: A food- oriented large language model that can pass the chef and dietetic examinations.Patterns, 6(5):101234, 2025

Pengfei Zhou, Weiqing Min, Chaoran Fu, Ying Jin, Mingyu Huang, Xiangyang Li, Shuhuan Mei, and Shuqiang Jiang. Foodsky: A food- oriented large language model that can pass the chef and dietetic examinations.Patterns, 6(5):101234, 2025

2025

[48] [48]

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models, 2025

Jinguo Zhu et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models, 2025

2025