FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images
Pith reviewed 2026-05-10 13:01 UTC · model grok-4.3
The pith
A new dataset of human ratings lets vision-language models predict taste, smell, texture and sound from food images along with visual explanations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce FoodSense, a human-annotated dataset for cross-sensory inference containing 66,842 participant-image pairs across 2,987 unique food images. Each pair includes numeric ratings (1-5) and free-text descriptors for four sensory dimensions: taste, smell, texture, and sound. To enable models to both predict and explain sensory expectations, we expand short human annotations into image-grounded reasoning traces generated by a large language model conditioned on the image, ratings, and descriptors. Using these annotations, we train FoodSense-VL, a vision language benchmark model to produce both multisensory ratings and grounded explanations directly from food images.
What carries the argument
The FoodSense dataset of image-paired numeric ratings and free-text descriptors, expanded into LLM-generated image-grounded reasoning traces that serve as training targets for a vision-language model.
If this is right
- Vision-language models can be fine-tuned to output numeric sensory ratings together with image-grounded textual justifications for unseen food photographs.
- Standard metrics common in vision-language evaluation are shown to be inadequate for judging success on multisensory prediction tasks.
- The approach demonstrates a scalable way to convert limited human sensory annotations into larger training signals for multimodal models.
- The resulting models connect cognitive-science findings on cross-sensory perception directly to instruction-tuned vision-language systems.
Where Pith is reading between the lines
- Models trained this way could support food-recommendation systems that anticipate how a dish will be perceived sensorially before it is prepared or ordered.
- The dataset opens a route to study systematic differences in visual-to-sensory mappings across demographic groups or cultural food traditions.
- Future tests could examine whether the same training recipe improves performance on related tasks such as predicting nutritional appeal or safety from appearance alone.
Load-bearing premise
The human ratings and descriptions accurately capture genuine cross-sensory expectations, and the language-model-generated reasoning traces remain faithful to the original images and annotations without introducing artifacts.
What would settle it
Collect fresh human ratings and descriptions for a new set of food images excluded from the original dataset, then measure whether FoodSense-VL predictions match those new ratings at rates clearly above chance or non-specialized baseline models.
Figures
read the original abstract
Humans routinely infer taste, smell, texture, and even sound from food images a phenomenon well studied in cognitive science. However, prior vision language research on food has focused primarily on recognition tasks such as meal identification, ingredient detection, and nutrition estimation. Image-based prediction of multisensory experience remains largely unexplored. We introduce FoodSense, a human-annotated dataset for cross-sensory inference containing 66,842 participant-image pairs across 2,987 unique food images. Each pair includes numeric ratings (1-5) and free-text descriptors for four sensory dimensions: taste, smell, texture, and sound. To enable models to both predict and explain sensory expectations, we expand short human annotations into image-grounded reasoning traces. A large language model generates visual justifications conditioned on the image, ratings, and descriptors. Using these annotations, we train FoodSense-VL, a vision language benchmark model to produce both multisensory ratings and grounded explanations directly from food images. This work connects cognitive science findings on cross-sensory perception with modern instruction tuning for multimodal models and shows that many popular evaluation metrics are insufficient for visually sensory inference.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FoodSense, a human-annotated dataset containing 66,842 participant-image pairs across 2,987 unique food images, with numeric ratings (1-5) and free-text descriptors for taste, smell, texture, and sound. It describes expanding these annotations into image-grounded reasoning traces via LLM conditioning on the image plus human data, then training FoodSense-VL, a vision-language model, to output both multisensory ratings and explanations directly from food images, while noting connections to cognitive science and limitations of existing VL evaluation metrics.
Significance. If the dataset annotations prove reliable and the model achieves strong performance with faithful explanations, this could establish a valuable benchmark bridging cognitive science findings on cross-sensory perception with modern multimodal instruction tuning. The dataset scale and dual focus on prediction plus grounded explanation represent a clear contribution over prior food-related VL work limited to recognition tasks.
major comments (2)
- [Abstract / reasoning trace generation] Abstract and methods description of reasoning trace generation: the pipeline expands human ratings/descriptors into LLM-generated visual justifications conditioned on the image and annotations, yet no human evaluation, hallucination checks, faithfulness metrics, or alignment scores are reported for these traces. This is load-bearing for the central claim that FoodSense-VL produces 'grounded explanations,' as unvalidated traces risk introducing artifacts that undermine the model's outputs.
- [Abstract / results] Abstract and results sections: the manuscript supplies no quantitative results, inter-annotator agreement statistics, validation procedures, baseline comparisons, or performance numbers for either the dataset quality or FoodSense-VL predictions. Without these, the empirical support for the dataset's utility and the model's effectiveness cannot be assessed.
minor comments (1)
- [Abstract] Abstract: the statement that 'many popular evaluation metrics are insufficient for visually sensory inference' is asserted without naming the metrics or providing supporting evidence, which reduces clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies key areas where additional validation and quantitative reporting will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Abstract / reasoning trace generation] Abstract and methods description of reasoning trace generation: the pipeline expands human ratings/descriptors into LLM-generated visual justifications conditioned on the image and annotations, yet no human evaluation, hallucination checks, faithfulness metrics, or alignment scores are reported for these traces. This is load-bearing for the central claim that FoodSense-VL produces 'grounded explanations,' as unvalidated traces risk introducing artifacts that undermine the model's outputs.
Authors: We agree that explicit validation of the reasoning traces is necessary to support the claim of grounded explanations. While the generation process conditions the LLM on both the input image and the original human ratings/descriptors to promote grounding, the current manuscript does not report human evaluations, hallucination checks, or quantitative faithfulness/alignment metrics. In the revised version we will add a dedicated evaluation subsection that includes: (i) human ratings of faithfulness on a held-out sample of traces, (ii) hallucination detection results, and (iii) alignment scores (e.g., semantic similarity and descriptor overlap) between the generated traces and the source human annotations. revision: yes
-
Referee: [Abstract / results] Abstract and results sections: the manuscript supplies no quantitative results, inter-annotator agreement statistics, validation procedures, baseline comparisons, or performance numbers for either the dataset quality or FoodSense-VL predictions. Without these, the empirical support for the dataset's utility and the model's effectiveness cannot be assessed.
Authors: We acknowledge that the abstract and the high-level results summary currently lack explicit numerical results, inter-annotator agreement (IAA) statistics, validation procedures, baseline comparisons, and performance numbers. The revised manuscript will expand both the abstract and the results section to include: (i) IAA metrics (e.g., Krippendorff’s alpha) for the numeric ratings and descriptor annotations, (ii) details of the annotation validation protocol, (iii) baseline comparisons for FoodSense-VL, and (iv) quantitative performance figures for rating prediction and explanation generation. These additions will make the empirical contributions immediately assessable. revision: yes
Circularity Check
No circularity: empirical dataset creation and model training
full rationale
The paper introduces a new human-annotated dataset (FoodSense) with numeric ratings and free-text descriptors for multisensory properties, expands annotations into LLM-generated reasoning traces, and trains a vision-language model (FoodSense-VL) on the resulting data. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-citations that bear the central claim are present. The contribution is data collection, annotation expansion, and benchmarking rather than a closed theoretical chain that reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Lampros Androutsos, Lorenzo Pallante, Agorakis Bom- potas, Filip Stojceski, Gianvito Grasso, Dario Piga, Giacomo Di Benedetto, Christos Alexakos, Athanasios Kalogeras, Konstantinos Theofilatos, Marco A. Deriu, and Seferina Mavroudi. Predicting multiple taste sensations with a multi- objective machine learning method.npj Science of Food, 8 (1):47, 2024. 3
work page 2024
-
[2]
Jason A. Avery, Alexander G. Liu, John E. Ingeholm, Stephen J. Gotts, and Alex Martin. Viewing images of foods evokes taste quality-specific activity in gustatory insular cor- tex.Proceedings of the National Academy of Sciences, 118 (2):e2010932118, 2021. 1, 2
work page 2021
- [3]
-
[4]
Boerner, Stephen Deems, Thomas R
Timothy J. Boerner, Stephen Deems, Thomas R. Furlani, Shelley L. Knuth, and John Towns. Access: Advancing in- novation: Nsf’s advanced cyberinfrastructure coordination ecosystem: Services & support. InPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good, page 173–176, New York, NY , USA, 2023. Association for Computin...
work page 2023
-
[5]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks, 2024
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks, 2024. 6
work page 2024
-
[6]
On domain-adaptive post-training for multimodal large language models, 2025
Daixuan Cheng, Shaohan Huang, Ziyu Zhu, Xintong Zhang, Wayne Xin Zhao, Zhongzhi Luan, Bo Dai, and Zhenliang Zhang. On domain-adaptive post-training for multimodal large language models, 2025. 2, 5, 7
work page 2025
-
[7]
Cross-modal interactions between color and texture of food
Mathew Chylinski, Gavin Northey, and Liem Viet Ngo. Cross-modal interactions between color and texture of food. Psychology & Marketing, 32(9):950–966, 2015. 1, 2
work page 2015
-
[8]
Andy Clark. Whatever next? predictive brains, situated agents, and the future of cognitive science.Behavioral and Brain Sciences, 36(3):181–204, 2013. 2
work page 2013
-
[9]
Qlora: Efficient finetuning of quantized llms,
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms,
-
[10]
Roland W. Fleming. Visual perception of materials and their properties.Vision Research, 94:62–75, 2014. 5
work page 2014
-
[11]
Roland W. Fleming. Material perception.Annual Review of Vision Science, 3(V olume 3, 2017):365–388, 2017. 5
work page 2017
- [12]
-
[13]
Michael Gunning and Ilias Tagkopoulos. A systematic re- view of data and models for predicting food flavor and tex- ture.Current Research in Food Science, 11:101127, 2025. 2, 8
work page 2025
-
[14]
MAmmoTH-VL: Eliciting multimodal reasoning with instruction tuning at scale
Jiawei Guo, Tianyu Zheng, Yizhi Li, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Graham Neubig, Wenhu Chen, and Xiang Yue. MAmmoTH-VL: Eliciting multimodal reasoning with instruction tuning at scale. InProceedings of the 63rd An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 13869–13920, Vienna, Austria, 2025. A...
work page 2025
-
[15]
Amir Hosseinian, Ashkan Dehghani Zahedani, Umer Man- soor, Noosheen Hashemi, and Mark Woodward. January food benchmark (jfb): A public benchmark dataset and eval- uation suite for multimodal food analysis, 2025. 1, 2
work page 2025
-
[16]
Byron P. Lee and Charles Spence. Crossmodal corre- spondences between basic tastes and visual design fea- tures: A narrative historical review.i-Perception, 13(5): 20416695221127325, 2022. 1, 2
work page 2022
-
[17]
A concordance correlation coeffi- cient to evaluate reproducibility.Biometrics, 45(1):255–268,
Lawrence I-Kuei Lin. A concordance correlation coeffi- cient to evaluate reproducibility.Biometrics, 45(1):255–268,
-
[18]
Visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 6
work page 2023
-
[19]
Caitlin Lloyd, Zarrar Shehzad, Janet Schebendach, Akram Bakkour, Alice M
E. Caitlin Lloyd, Zarrar Shehzad, Janet Schebendach, Akram Bakkour, Alice M. Xue, Naomi Folasade Assaf, Rayman Ji- lani, B. Timothy Walsh, Joanna Steinglass, and Karin Fo- erde. Food folio by columbia center for eating disorders: A freely available food image database.Frontiers in Psychol- ogy, V olume 11 - 2020, 2020. 3
work page 2020
-
[20]
Food-500 cap: A fine-grained food caption benchmark for evaluating vision-language models, 2023
Zheng Ma, Mianzhi Pan, Wenhan Wu, Kanzhi Cheng, Jian- bing Zhang, Shujian Huang, and Jiajun Chen. Food-500 cap: A fine-grained food caption benchmark for evaluating vision-language models, 2023. 1, 2, 3
work page 2023
-
[21]
Tastes and textures estimation of foods based on the analysis of its ingredients list and image
Hiroki Matsunaga, Keisuke Doman, Takatsugu Hirayama, Ichiro Ide, Daisuke Deguchi, and Hiroshi Murase. Tastes and textures estimation of foods based on the analysis of its ingredients list and image. InNew Trends in Image Analysis and Processing – ICIAP 2015 Workshops, pages 326–333, Cham, 2015. Springer International Publishing. 2, 3
work page 2015
-
[22]
Kosuke Motoki, Charles Spence, and Carlos Velasco. When visual cues influence taste/flavour perception: A systematic review.Food Quality and Preference, 111:104996, 2023. 1, 2
work page 2023
-
[23]
Karl Pearson. Note on regression and inheritance in the case of two parents.Proceedings of the Royal Society of London, 58:240–242, 1895. 7
-
[24]
Betina Piqueras-Fiszman and Charles Spence. Sensory ex- pectations based on product-extrinsic food cues: An inter- disciplinary review of the empirical evidence and theoretical accounts.Food Quality and Preference, 40:165–179, 2015. 2
work page 2015
-
[25]
Qwen2.5 technical report, 2025
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Jun- yang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao L...
work page 2025
-
[26]
Sergio Romero-Tapiador, Ruben Tolosana, Blanca Lacruz- Pleguezuelos, Laura Judith Marcos Zambrano, Guadalupe X. Baz´an, Isabel Espinosa-Salinas, Julian Fierrez, Javier Ortega-Garcia, Enrique Carrillo de Santa Pau, and Aythami Morales. Are vision-language models ready for dietary as- sessment? exploring the next frontier in ai-powered food im- age recognit...
work page 2025
-
[27]
Michiel Schreurs, Supinya Piampongsant, Miguel Ron- coroni, Lloyd Cool, Beatriz Herrera-Malaver, Christophe Vanderaa, Florian A. Theßeling, Łukasz Kreft, Alexander Botzki, Philippe Malcorps, Luk Daenen, Tom Wenseleers, and Kevin J. Verstrepen. Predicting and improving complex beer flavor through machine learning.Nature Communica- tions, 15(1):2368, 2024. 3
work page 2024
-
[28]
Patrick E. Shrout and Joseph L. Fleiss. Intraclass correla- tions: Uses in assessing rater reliability.Psychological Bul- letin, 86(2):420–428, 1979. 3
work page 1979
-
[29]
Gregory Simmonds and Charles Spence. Thinking inside the box: How seeing products on, or through, the packaging influences consumer perceptions and purchase behaviour. Food Quality and Preference, 62:340–351, 2017. 1
work page 2017
-
[30]
C Spearman. The proof and measurement of association be- tween two things.International Journal of Epidemiology, 39 (5):1137–1150, 2010. 7
work page 2010
-
[31]
Eriko Sugimori and Yayoi Kawasaki. Cross-modal corre- spondence between visual information and taste perception of bitter foods and drinks.Food Quality and Preference, 98: 104539, 2022. 2
work page 2022
-
[32]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Cas- bon, Etienne Pot, Ivo Penchev, Ga ¨el Liu, Francesco Visin, Kathleen Kenealy,...
work page 2025
-
[33]
Laura N. van der Laan, Ignace T.C. Hooge, Denise T.D. de Ridder, Max A. Viergever, and Paul A.M. Smeets. Do you like what you see? the role of first fixation and total fixation duration in consumer choice.Food Quality and Preference, 39:46–55, 2015. 2, 3
work page 2015
-
[34]
Learning visual grounding from generative vi- sion and language model, 2024
Shijie Wang, Dahun Kim, Ali Taalimi, Chen Sun, and We- icheng Kuo. Learning visual grounding from generative vi- sion and language model, 2024. 5
work page 2024
-
[35]
Zhenbo Xu, Jinghan Yang, Gong Huang, Jiqing Feng, Liu Liu, Ruihan Sun, Ajin Meng, Zhuo Zhang, and Zhaofeng He. Sfood: A multimodal benchmark for comprehensive food at- tribute analysis beyond rgb with spectral insights, 2025. 3
work page 2025
-
[36]
Regression in eo: Are vlms up to the challenge?, 2025
Xizhe Xue and Xiao Xiang Zhu. Regression in eo: Are vlms up to the challenge?, 2025. 5
work page 2025
-
[37]
Lianghui Zhu, Xinggang Wang, and Xinlong Wang. Judgelm: Fine-tuned large language models are scalable judges, 2025. 2, 5 A. FoodSense Annotation Protocol A.1. Task Design Participants were shown one food image at a time and asked to evaluate four sensory dimensions:taste,smell,texture, andsound. For each dimension, participants completed two sub-tasks seq...
work page 2025
-
[38]
whenever the image provided insufficient visual cues. Qualitative descriptor.After rating, participants were asked:“What do you think this food would sound like, taste like, smell like, and feel like (texture)? Please write one or two words for each sense. ”Representative responses includecrispy,golden edges,smoky, andsilent. This dual-format design captu...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.