pith. sign in

arxiv: 2604.06352 · v1 · submitted 2026-04-07 · 💻 cs.CV · cs.AI· cs.MM· eess.IV

DietDelta: A Vision-Language Approach for Dietary Assessment via Before-and-After Images

Pith reviewed 2026-05-10 18:28 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MMeess.IV
keywords dietary assessmentbefore-and-after imagesvision-language modelsfood consumptionweight estimationnutritional analysisimage-based assessment
0
0 comments X

The pith

Vision-language prompts on paired before-and-after food images enable item-level weight and consumption estimates from ordinary RGB photos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a vision-language model can localize specific food items and estimate their weights using natural language prompts on single RGB images, then compute consumption as the difference between before-and-after pairs. This approach avoids the need for depth sensors, multi-view captures, or manual segmentation masks that limit prior dietary assessment tools. A sympathetic reader would care because current single-image methods give only coarse meal-level totals and cannot confirm what was actually eaten, while this framework aims for precise, item-by-item nutritional tracking. The authors train the system in two stages to first handle localization and weight prediction, then difference estimation, and report better results than existing methods on three public datasets.

Core claim

The paper claims that a simple vision-language framework can perform food-item-level nutritional analysis by applying natural language prompts to localize items and estimate weights directly from single RGB images, then predict consumption through weight differences between before-and-after image pairs using a two-stage training process. This yields consistent improvements over prior approaches across three public datasets and serves as a baseline for before-and-after dietary image analysis without requiring depth information, multi-view imagery, or explicit segmentation masks.

What carries the argument

A two-stage vision-language model that applies natural language prompts to paired RGB images to localize food items, predict individual weights, and compute consumption as the difference between the before and after estimates.

Load-bearing premise

Natural language prompts on ordinary single RGB images are sufficient to accurately localize food items and estimate their weights without depth data, multiple views, or segmentation masks.

What would settle it

Ground-truth weight measurements on a new set of before-and-after food images where the model's predicted differences show no accuracy gain over single-image baselines would refute the claim of consistent improvements.

Figures

Figures reproduced from arXiv: 2604.06352 by Bruce Coburn, Fengqing Zhu, Gautham Vinod, Siddeshwar Raghavan.

Figure 1
Figure 1. Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Method Overview. The two stage training strategy uses only the before eating images in the Absolute Weight Estimation stage to learn the patches related to the input prompt. This knowledge is used to finetune the model in the Weight Difference Estimation stage to predict the weight difference of the food item in the text prompt. The text embeddings and image patch embeddings are fused to learn the most rel… view at source ↗
Figure 3
Figure 3. Figure 3: Predicted and Ground Truth Weight Difference Analysis. (a) Frequency distribution of weight differences show￾ing strong overlap between predictions and ground truth. (b) Bi￾variate density plot of predictions vs. ground truth, stratified by food structure. The alignment along the y = x diagonal across both Solid and Amorphous types confirms the model’s generaliza￾tion capability. Weight Difference Estimati… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Results. Images from the different datasets are analyzed with corresponding text prompts to show the activa￾tion of the images patches via the cross-attention mechanism. The text prompt serves as a semantic anchor, guiding the model’s attention to the specific region of interest within the complex scene of a meal. To validate that our model suc￾cessfully learns this text-to-visual correspondenc… view at source ↗
read the original abstract

Accurate dietary assessment is critical for precision nutrition, yet most image-based methods rely on a single pre-consumption image and provide only coarse, meal-level estimates. These approaches cannot determine what was actually consumed and often require restrictive inputs such as depth sensing, multi-view imagery, or explicit segmentation. In this paper, we propose a simple vision-language framework for food-item-level nutritional analysis using paired before-and-after eating images. Instead of relying on rigid segmentation masks, our method leverages natural language prompts to localize specific food items and estimate their weight directly from a single RGB image. We further estimate food consumption by predicting weight differences between paired images using a two-stage training strategy. We evaluate our method on three publicly available datasets and demonstrate consistent improvements over existing approaches, establishing a strong baseline for before-and-after dietary image analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes DietDelta, a vision-language framework for food-item-level dietary assessment that takes paired before-and-after eating images as input. It uses natural language prompts on a single RGB image to localize specific food items and directly regress their weights, then applies a two-stage training procedure to predict consumption from the weight differences between the pair. The method avoids depth sensing, multi-view capture, or explicit segmentation masks. Evaluation on three public datasets is reported to yield consistent improvements over prior single-image approaches, positioning the work as a baseline for before-and-after dietary image analysis.

Significance. If the quantitative gains and weight-estimation accuracy hold under scrutiny, the work would be a useful incremental contribution to precision nutrition. It shifts focus from pre-consumption meal-level estimates to actual item-level consumption using only ordinary RGB pairs, which are easier to collect than depth or multi-view data. The two-stage VLM pipeline is conceptually simple and could serve as a reproducible starting point for follow-on research that adds calibration or multi-view constraints.

major comments (3)
  1. [§3] §3 (Method): The central claim that natural-language prompts applied to a monocular RGB image suffice to localize items and regress accurate weights is not accompanied by any analysis of the geometric and photometric ambiguity. Weight is volume times density; the manuscript provides no mechanism, calibration step, or auxiliary loss that recovers 3D shape or food-specific density from 2D appearance alone. Because downstream consumption is computed from these per-item estimates, any systematic bias in the first stage directly undermines the reported gains from the before-and-after formulation.
  2. [§4] §4 (Experiments): No ablation isolates the contribution of the paired-image difference prediction from the single-image weight estimator. Without such controls, it is impossible to attribute the claimed improvements on the three datasets to the before-and-after design rather than to a stronger base VLM or better prompting. In addition, the results tables lack per-item weight error metrics, error propagation analysis, or statistical significance tests against the strongest single-image baselines.
  3. [§4.2] §4.2 (Datasets and metrics): The evaluation uses three public datasets yet reports only aggregate improvements without breaking down performance by food category, occlusion level, or lighting variation. This makes it difficult to assess whether the method generalizes or merely exploits dataset-specific biases in the before-and-after pairs.
minor comments (3)
  1. [Abstract] The abstract states 'consistent improvements' without naming the datasets or quoting any numeric deltas; adding one or two key numbers would improve readability.
  2. [§3.3] Notation for the two-stage loss (Eq. 3 and Eq. 5) uses the same symbol for the weight estimator in both stages; a subscript distinguishing the stages would reduce confusion.
  3. [Figure 2] Figure 2 caption does not specify the exact prompt templates used for localization and weight regression; including them would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of our method and evaluation that we will address to strengthen the manuscript. We respond to each major comment below.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The central claim that natural-language prompts applied to a monocular RGB image suffice to localize items and regress accurate weights is not accompanied by any analysis of the geometric and photometric ambiguity. Weight is volume times density; the manuscript provides no mechanism, calibration step, or auxiliary loss that recovers 3D shape or food-specific density from 2D appearance alone. Because downstream consumption is computed from these per-item estimates, any systematic bias in the first stage directly undermines the reported gains from the before-and-after formulation.

    Authors: We acknowledge that monocular RGB images inherently contain geometric and photometric ambiguities, and our framework does not include an explicit 3D reconstruction module or food-specific density estimation. Instead, the vision-language model learns implicit mappings from 2D appearance to weight via supervised training on datasets with ground-truth weights. The before-and-after formulation is intended to reduce some biases by emphasizing differences rather than absolute values. We agree that a dedicated analysis of these limitations is warranted. We will add a new subsection discussing potential error sources from viewpoint, lighting, and density variations, along with qualitative examples of failure cases. revision: partial

  2. Referee: [§4] §4 (Experiments): No ablation isolates the contribution of the paired-image difference prediction from the single-image weight estimator. Without such controls, it is impossible to attribute the claimed improvements on the three datasets to the before-and-after design rather than to a stronger base VLM or better prompting. In addition, the results tables lack per-item weight error metrics, error propagation analysis, or statistical significance tests against the strongest single-image baselines.

    Authors: We agree that an explicit ablation is necessary to isolate the benefit of the paired-image stage. We will add experiments comparing the full two-stage model against a single-image weight estimator baseline (using the same VLM backbone and prompts) on all three datasets. We will also report per-item mean absolute percentage error (MAPE) for weight estimation, include a basic error propagation discussion for the difference computation, and add statistical significance tests (e.g., paired t-tests) against the strongest single-image baselines. revision: yes

  3. Referee: [§4.2] §4.2 (Datasets and metrics): The evaluation uses three public datasets yet reports only aggregate improvements without breaking down performance by food category, occlusion level, or lighting variation. This makes it difficult to assess whether the method generalizes or merely exploits dataset-specific biases in the before-and-after pairs.

    Authors: We will expand the experimental section with per-food-category breakdowns (e.g., for common categories like fruits, proteins, and grains) in the main paper or supplementary material. For occlusion and lighting, we will analyze performance on dataset subsets where such variations are annotated or can be inferred, and add a short discussion on how the before-and-after pairs help mitigate certain biases. This will better demonstrate the method's robustness. revision: partial

Circularity Check

0 steps flagged

No circularity detected in the proposed vision-language dietary assessment framework

full rationale

The paper describes a standard applied ML pipeline: a vision-language model that uses natural language prompts on single RGB images to localize food items and regress weights, followed by differencing on paired before-and-after images via a two-stage training procedure. Evaluation consists of quantitative comparison against prior methods on three external public datasets. No mathematical derivations, equations, fitted-parameter renamings, uniqueness theorems, or self-citation chains appear that would reduce the reported improvements to a tautology or construction from the inputs themselves. The central claims therefore remain independent of the evaluation data and are not forced by definition or self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that current vision-language models can perform accurate food localization and weight regression from monocular RGB images when guided by text prompts; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Vision-language models can localize specific food items and regress their weights directly from single RGB images using natural language prompts
    This premise is required for the localization and weight-estimation steps described in the abstract.

pith-pipeline@v0.9.0 · 5451 in / 1323 out tokens · 68087 ms · 2026-05-10T18:28:56.330493+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 4 internal anchors

  1. [1]

    Western diets and chronic diseases.Nature medicine, 30(8):2133–2147, 2024

    Timon E Adolph and Herbert Tilg. Western diets and chronic diseases.Nature medicine, 30(8):2133–2147, 2024. 1

  2. [2]

    Food-101–mining discriminative components with random forests.European conference on computer vision, pages 446–461, 2014

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests.European conference on computer vision, pages 446–461, 2014. 2

  3. [3]

    Compre- hensive evaluation of large multimodal models for nutrition analysis: A new benchmark enriched with contextual meta- data.arXiv preprint arXiv:2507.07048, 2025

    Bruce Coburn, Jiangpeng He, Megan E Rollo, Satvinder S Dhaliwal, Deborah A Kerr, and Fengqing Zhu. Compre- hensive evaluation of large multimodal models for nutrition analysis: A new benchmark enriched with contextual meta- data.arXiv preprint arXiv:2507.07048, 2025. 2, 4

  4. [4]

    Two-view 3d reconstruction for food volume estimation.IEEE Transactions on Multimedia, 19(5):1090–1099, 2017

    Joachim Dehais, Marios Anthimopoulos, Sergey Shevchik, and Stavroula Mougiakakou. Two-view 3d reconstruction for food volume estimation.IEEE Transactions on Multimedia, 19(5):1090–1099, 2017. 2

  5. [5]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2

  6. [6]

    A comparison of food portion size estimation using geometric models and depth images.2016 IEEE International Conference on Image Pro- cessing (ICIP), pages 26–30, 2016

    Shaobo Fang, Fengqing Zhu, Chufan Jiang, Song Zhang, Carol J Boushey, and Edward J Delp. A comparison of food portion size estimation using geometric models and depth images.2016 IEEE International Conference on Image Pro- cessing (ICIP), pages 26–30, 2016. 2

  7. [7]

    Ingredient-guided rgb-d fusion network for nutritional as- sessment.IEEE Transactions on AgriFood Electronics, 3(1): 156–166, 2025

    Zhihui Feng, Hao Xiong, Weiqing Min, Sujuan Hou, Huichuan Duan, Zhonghua Liu, and Shuqiang Jiang. Ingredient-guided rgb-d fusion network for nutritional as- sessment.IEEE Transactions on AgriFood Electronics, 3(1): 156–166, 2025. 2

  8. [8]

    Mobile food calorie estima- tion using smartphone lidar sensor.Asian Conference on Pat- tern Recognition, pages 134–148, 2025

    Haruto Fujita and Keiji Yanai. Mobile food calorie estima- tion using smartphone lidar sensor.Asian Conference on Pat- tern Recognition, pages 134–148, 2025. 1

  9. [9]

    Sin- gle image-based food volume estimation using monocular depth-prediction networks.International Conference on Human-Computer Interaction, pages 532–543, 2020

    Alexandros Graikos, Vasileios Charisis, Dimitrios Iakovakis, Stelios Hadjidimitriou, and Leontios Hadjileontiadis. Sin- gle image-based food volume estimation using monocular depth-prediction networks.International Conference on Human-Computer Interaction, pages 532–543, 2020. 2

  10. [10]

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 5

  11. [11]

    Preci- sion nutrition: A systematic literature review.Computers in Biology and Medicine, 133:104365, 2021

    Daniel Kirk, Cagatay Catal, and Bedir Tekinerdogan. Preci- sion nutrition: A systematic literature review.Computers in Biology and Medicine, 133:104365, 2021. 1

  12. [12]

    An automated image-based dietary assessment sys- tem for mediterranean foods.IEEE Open Journal of Engi- neering in Medicine and Biology, 4:45–54, 2023

    Fotios S Konstantakopoulos, Eleni I Georga, and Dimitrios I Fotiadis. An automated image-based dietary assessment sys- tem for mediterranean foods.IEEE Open Journal of Engi- neering in Medicine and Biology, 4:45–54, 2023. 1

  13. [13]

    https: //arxiv.org/abs/2512.07921

    Zongwei Li, Zhonghang Li, Zirui Guo, Xubin Ren, and Chao Huang. Deepcode: Open agentic coding.arXiv preprint arXiv:2512.07921, 2025. 5

  14. [14]

    Lo, Yingnan Sun, and Benny Lo

    Frank P.-W. Lo, Yingnan Sun, and Benny Lo. Depth estima- tion based on a single close-up image with volumetric anno- tations in the wild: A pilot study.2019 IEEE/ASME Inter- national Conference on Advanced Intelligent Mechatronics (AIM), pages 513–518, 2019. 2

  15. [15]

    Image-based food classification and volume estimation for dietary assessment: A review.IEEE Journal of Biomedical and Health Informatics, 24(7):1926–1939, 2020

    Frank Po Wen Lo, Yingnan Sun, Jianing Qiu, and Benny Lo. Image-based food classification and volume estimation for dietary assessment: A review.IEEE Journal of Biomedical and Health Informatics, 24(7):1926–1939, 2020. 2

  16. [16]

    Image-based food classification and volume estimation for dietary assessment: A review.IEEE journal of biomedical and health informatics, 24(7):1926–1939, 2020

    Frank Po Wen Lo, Yingnan Sun, Jianing Qiu, and Benny Lo. Image-based food classification and volume estimation for dietary assessment: A review.IEEE journal of biomedical and health informatics, 24(7):1926–1939, 2020. 1

  17. [17]

    Mfp3d: Monocular food portion estimation leveraging 3d point clouds.International Conference on Pattern Recognition, pages 49–62, 2024

    Jinge Ma, Xiaoyan Zhang, Gautham Vinod, Siddeshwar Raghavan, Jiangpeng He, and Fengqing Zhu. Mfp3d: Monocular food portion estimation leveraging 3d point clouds.International Conference on Pattern Recognition, pages 49–62, 2024. 1

  18. [18]

    Mea- suring food intake with digital photography.Journal of Hu- man Nutrition and Dietetics, 27:72–81, 2014

    Corby K Martin, Theresa Nicklas, Bahadir Gunturk, John B Correa, H Raymond Allen, and Catherine Champagne. Mea- suring food intake with digital photography.Journal of Hu- man Nutrition and Dietetics, 27:72–81, 2014. 2

  19. [19]

    Divya Mereddy and Jeevan Sai Reddy Beedareddy. En- abling next-generation smart homes through bert personal- ized food recommendations - recipebert.2024 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), pages 796–803, 2024. 7, 8

  20. [20]

    Austin Meyers, Nick Johnston, Vivek Rathod, Anoop Korat- tikara, Alex Gorban, Nathan Silberman, Sergio Guadarrama, George Papandreou, Jonathan Huang, and Kevin P. Murphy. Im2calories: Towards an automated mobile vision food di- ary.Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. 1

  21. [21]

    Im2calories: Towards an automated mobile vision food di- ary.2015 IEEE International Conference on Computer Vi- sion (ICCV), pages 1233–1241, 2015

    Austin Myers, Nick Johnston, Vivek Rathod, Anoop Korat- tikara, Alex Gorban, Nathan Silberman, Sergio Guadarrama, George Papandreou, Jonathan Huang, and Kevin Murphy. Im2calories: Towards an automated mobile vision food di- ary.2015 IEEE International Conference on Computer Vi- sion (ICCV), pages 1233–1241, 2015. 2

  22. [22]

    Macmillan+ ORM, 2025

    Marion Nestle.What to eat. Macmillan+ ORM, 2025. 1

  23. [23]

    Recognition and volume estimation of food intake using a mobile device.2009 Workshop on Appli- cations of Computer Vision (WACV), pages 1–8, 2009

    Manika Puri, Zhiwei Zhu, Qian Yu, Ajay Divakaran, and Harpreet Sawhney. Recognition and volume estimation of food intake using a mobile device.2009 Workshop on Appli- cations of Computer Vision (WACV), pages 1–8, 2009. 2

  24. [24]

    Learning transferable visual models from natural language supervision.Proceedings of the 38th International Conference on Machine Learning, 139:8748–8763, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.Proceedings of the 38th International Conference on Machine Learning, 139:8748–8763, 2021. 2, 3, 4, 7, 8

  25. [25]

    Foodcam: a novel structured light-stereo imaging system for food portion size estimation.Sensors, 22(9):3300, 2022

    Viprav B Raju and Edward Sazonov. Foodcam: a novel structured light-stereo imaging system for food portion size estimation.Sensors, 22(9):3300, 2022. 1

  26. [26]

    A multitask deep learning model for food scene recog- nition and portion estimation-the food portion benchmark (fpb) dataset.IEEE Access, 2025

    Aibota Sanatbyek, Tomiris Rakhimzhanova, Bibinur Nur- manova, Zhuldyz Omarova, Aidana Rakhmankulova, Rustem Orazbayev, Huseyin Atakan Varol, and Mei Yen Chan. A multitask deep learning model for food scene recog- nition and portion estimation-the food portion benchmark (fpb) dataset.IEEE Access, 2025. 2, 4, 5

  27. [27]

    Depth image multi-scale fusion network: a novel approach for food nutrition estimation.Network Modeling Analysis in Health Informatics and Bioinformatics, 14(1): 159, 2025

    Xiao Shan, Masato Tagi, Ruiqing Liu, Takeshi Konishi, and Jun Hirose. Depth image multi-scale fusion network: a novel approach for food nutrition estimation.Network Modeling Analysis in Health Informatics and Bioinformatics, 14(1): 159, 2025. 1

  28. [28]

    Rapid non-destructive analysis of food nutrient content using swin-nutrition.Foods, 11(21):3429, 2022

    Wenjing Shao, Sujuan Hou, Weikuan Jia, and Yuanjie Zheng. Rapid non-destructive analysis of food nutrient content using swin-nutrition.Foods, 11(21):3429, 2022. 5, 6

  29. [29]

    Zeman Shao, Gautham Vinod, Jiangpeng He, and Fengqing Zhu. An end-to-end food portion estimation framework based on shape reconstruction from monocular image.2023 IEEE International Conference on Multimedia and Expo (ICME), pages 942–947, 2023. 1, 2

  30. [30]

    Amy F Subar, Sharon I Kirkpatrick, Beth Mittl, Thea Palmer Zimmerman, Frances E Thompson, Christopher Bingley, Gordon Willis, Noemi G Islam, Tom Baranowski, Suzanne McNutt, et al. The automated self-administered 24-hour di- etary recall (asa24): a resource for researchers, clinicians and educators from the national cancer institute.Journal of the Academy ...

  31. [31]

    Food volume estimation based on stereo image analysis.IEEE Instrumentation & Measure- ment Magazine, 21(6):36–43, 2018

    Mohammed A Subhi, Sawal Hamid Md Ali, Ahmad G Is- mail, and Masuri Othman. Food volume estimation based on stereo image analysis.IEEE Instrumentation & Measure- ment Magazine, 21(6):36–43, 2018. 1

  32. [32]

    A comprehen- sive survey of image-based food recognition and volume es- timation methods for dietary assessment.Healthcare, 9(12): 1676, 2021

    Ghalib Ahmed Tahir and Chu Kiong Loo. A comprehen- sive survey of image-based food recognition and volume es- timation methods for dietary assessment.Healthcare, 9(12): 1676, 2021. 2

  33. [33]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini Team. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next gen- eration agentic capabilities, 2025. arXiv:2507.06261. 5, 6

  34. [34]

    Gemma 3 Technical Report

    Gemma Team. Gemma 3 technical report, 2025. arXiv:2503.19786. 5, 6

  35. [35]

    Nutrition5k: To- wards automatic nutritional understanding of generic food

    Quin Thames, Arjun Karpur, Wade Norris, Fangting Xia, Liviu Panait, Tobias Weyand, and Jack Sim. Nutrition5k: To- wards automatic nutritional understanding of generic food. Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 8903–8911,

  36. [36]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 7, 8

  37. [37]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 3

  38. [38]

    Food portion estima- tion: From pixels to calories.Proceedings of the 2026 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI), 2026

    Gautham Vinod and Fengqing Zhu. Food portion estima- tion: From pixels to calories.Proceedings of the 2026 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI), 2026. 2

  39. [39]

    Food portion estimation via 3d object scaling.Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 3741–3749,

    Gautham Vinod, Jiangpeng He, Zeman Shao, and Fengqing Zhu. Food portion estimation via 3d object scaling.Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 3741–3749,

  40. [40]

    Food portion estimation via 3d object scaling.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 3741–3749, 2024

    Gautham Vinod, Jiangpeng He, Zeman Shao, and Fengqing Zhu. Food portion estimation via 3d object scaling.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 3741–3749, 2024. 2

  41. [41]

    Gautham Vinod, Bruce Coburn, Siddeshwar Raghavan, Jiangpeng He, and Fengqing Zhu. Size matters: Reconstruct- ing real-scale 3d models from monocular images for food portion estimation.Proceedings of the 2026 IEEE Confer- ence on Artificial Intelligence (CAI), 2026. 1

  42. [42]

    Ingredients-guided and nutrients-prompted network for food nutrition estimation.Proceedings of the 33rd ACM Interna- tional Conference on Multimedia, pages 9159–9167, 2025

    Donglin Zhang, Boyuan Ma, Xiaojun Wu, and Josef Kittler. Ingredients-guided and nutrients-prompted network for food nutrition estimation.Proceedings of the 33rd ACM Interna- tional Conference on Multimedia, pages 9159–9167, 2025. 2

  43. [43]

    Vi- sual nutrition analysis: leveraging segmentation and regres- sion for food nutrient estimation.Frontiers in Nutrition, 11: 1469878, 2024

    Yaping Zhao, Ping Zhu, Yizhang Jiang, and Kaijian Xia. Vi- sual nutrition analysis: leveraging segmentation and regres- sion for food nutrient estimation.Frontiers in Nutrition, 11: 1469878, 2024. 1