arxiv: 2604.09886 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.AI· cs.LG· cs.MM· eess.IV

Recognition: unknown

Not Your Stereo-Typical Estimator: Combining Vision and Language for Volume Perception

Gautham Vinod , Bruce Coburn , Siddeshwar Raghavan , Fengqing Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.MMeess.IV

keywords volume estimationstereo visionmulti-modal fusiontext priorscomputer visionobject measurementfeature projectionregression

0 comments

The pith

Combining stereo image features with text prompts containing object class and approximate volume improves estimation accuracy over vision-only methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that even basic textual priors can guide volume estimation from stereo images by supplying class information and a rough size hint that resolves visual ambiguities. This would matter because existing approaches either demand full 3D reconstruction pipelines or produce unreliable results from single-view data, restricting use in robotics, logistics, and health monitoring. The proposed system extracts deep features from both the stereo pair and the text prompt, then merges them through a projection layer into one representation for regression. Experiments on public datasets show consistent gains over baselines that use images alone. A sympathetic reader would conclude that language can serve as lightweight context without requiring complex geometric modeling.

Core claim

The central claim is that a fusion method extracts deep features from a stereo image pair and from a descriptive text prompt that includes the object's class and an approximate volume, projects these features into a unified multi-modal representation via a simple layer, and regresses volume from the combined output. This text-guided estimator significantly outperforms vision-only baselines on public datasets, showing that textual priors effectively direct the estimation task.

What carries the argument

A projection layer that integrates deep features from a stereo image pair with features from a text prompt into a single multi-modal representation used for volume regression.

If this is right

The text-guided approach significantly outperforms vision-only baselines on public datasets.
Simple textual priors can effectively guide the volume estimation task without complex 3D pipelines.
The method supports development of more context-aware visual measurement systems.
Applications in robotics, logistics, and smart health gain from reduced ambiguity in stereo or single-view data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same projection-layer fusion could extend to estimating other scalar properties such as mass or density if matching language priors are supplied.
If text prompts are produced automatically by a separate object classifier, the system could run with little human input.
Evaluating performance when text priors contain moderate noise would show how robust the fusion remains in realistic settings.

Load-bearing premise

The text prompt supplies an approximate volume that is accurate enough to act as a useful prior and can be integrated without the visual features being dominated or the result becoming a trivial refinement of the text input.

What would settle it

Replace the volume value in the text prompt with a deliberately inaccurate number on the same test set and measure whether estimation error rises above the vision-only baseline; if error stays the same or drops, the claim that text acts as a guiding prior is falsified.

Figures

Figures reproduced from arXiv: 2604.09886 by Bruce Coburn, Fengqing Zhu, Gautham Vinod, Siddeshwar Raghavan.

**Figure 1.** Figure 1: Modeling the Human Cognition System. Our stereo vision enables us to capture depth information; however, it is our prior knowledge about the shapes and sizes of objects in the scene that enables us to estimate distances and measurements. We leverage the intuition of our vision and cognition system to use stereo images and prior knowledge via natural language to create a multi-modal fusion model for accurat… view at source ↗

**Figure 2.** Figure 2: Overview of Our Method: Stereo images are passed through a feature extractor to generate embeddings, which are concatenated and used for classification to obtain the corresponding text prior for each class. The text embedding is generated from the constructed prompt with the text and volume prior. The combined image and text features are projected into a shared space. Finally, a regression network estimate… view at source ↗

**Figure 3.** Figure 3: Visualization of our multimodal projection - The image and text features each form three separate clusters corresponding to the same class. Our projection maps these modalities into a unified space, where semantically similar image and text features cluster together. Next, we project the combined visual and textual features into a common latent space of dimension K using a Multi-Layer Perceptron (MLP) θ… view at source ↗

**Figure 4.** Figure 4: Error Distribution of Volume Estimation Methods. We highlight our method (blue) against the second-best performing method, GPT-5 with context (orange), with regards to MAPE. (Top) The Cumulative Distribution Function (CDF) of absolute errors on a log scale where a shift or curve to the upper left indicates better performance. (Bottom) The Kernel Density Estimation (KDE) of percentage errors where a tall… view at source ↗

**Figure 5.** Figure 5: Error Distribution of Volume Estimation Methods. We highlight our method (blue) against the second-best performing method, GPT-5 with context (orange), with regards to MAPE. (Top) The Cumulative Distribution Function (CDF) of absolute errors on a log scale where a shift or curve to the upper left indicates better performance. (Bottom) The Kernel Density Estimation (KDE) of percentage errors where a tall… view at source ↗

**Figure 7.** Figure 7: Predicted volumes plotted against ground truth volumes [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Predicted volumes plotted against ground truth volumes [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

read the original abstract

Accurate volume estimation of objects from visual data is a long-standing challenge in computer vision with significant applications in robotics, logistics, and smart health. Existing methods often rely on complex 3D reconstruction pipelines or struggle with the ambiguity inherent in single-view images. To address these limitations, we introduce a new method that fuses implicit 3D cues from stereo vision with explicit prior knowledge from natural language text. Our approach extracts deep features from a stereo image pair and a descriptive text prompt that contains the object's class and an approximate volume, then integrates them using a simple yet effective projection layer into a unified, multi-modal representation for regression. We conduct extensive experiments on public datasets demonstrating that our text-guided approach significantly outperforms vision-only baselines. Our findings show that leveraging even simple textual priors can effectively guide the volume estimation task, paving the way for more context-aware visual measurement systems. Code: https://gitlab.com/viper-purdue/stereo-typical-estimator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper fuses stereo features with a text prompt that already includes an approximate volume for object volume regression, but the abstract gives no numbers or ablations so the real value of the fusion is hard to judge.

read the letter

The punchline is that this work fuses stereo image features with a text prompt that includes both the object class and an approximate volume, then regresses the final volume through a projection layer. It reports better results than vision-only baselines on public datasets. What is new is the explicit inclusion of a volume hint in the language input for this regression task. Most multimodal vision-language work uses text for classification or description, not as a direct prior for a continuous measurement like volume. The approach keeps the fusion simple with one projection layer, which is a practical choice. The paper does a decent job framing the problem and linking it to applications in robotics and logistics. Having code available is helpful for anyone wanting to reproduce or build on it. The soft spots are around the evidence. The abstract says the method significantly outperforms baselines but provides no actual numbers, no ablation studies, and no details on how the text prompt's volume is generated or how approximate it is. The stress-test concern is on point: if the text already supplies a usable volume estimate, the stereo features might not be contributing much, and the claim of effective guidance from priors would need a text-only control to hold up. Without those, it's difficult to separate the contribution of the multimodal part from just using the prior. This paper is for people working on practical computer vision applications where some side information like rough size estimates might be available from language. A reader interested in multimodal methods for measurement tasks could get some ideas from it, though they would need to dig into the full experiments to see the effect sizes. I would recommend sending it to peer review. The core idea is coherent and the method is straightforward enough that referees can assess the experimental claims once the details are in front of them.

Referee Report

3 major / 0 minor

Summary. The paper proposes fusing deep features extracted from stereo image pairs with embeddings from natural language text prompts (containing object class and an approximate volume) via a projection layer to produce a multimodal representation for direct volume regression. It claims this text-guided approach significantly outperforms vision-only baselines on public datasets and demonstrates that even simple textual priors can effectively guide volume estimation.

Significance. If the central claim is supported by appropriate controls showing that the text prior is only approximate and that the fusion adds value beyond the prior alone, the work would provide evidence for practical multimodal guidance in visual measurement tasks, with potential impact in robotics and logistics. The availability of a public code repository strengthens the contribution by enabling reproducibility.

major comments (3)

Abstract: the claim of 'significant outperformance' on public datasets is presented without any quantitative metrics, ablation results, or error analysis, leaving the headline result without visible empirical grounding in the manuscript text.
Method section (projection layer description): the integration is characterized only as 'a simple yet effective projection layer' with no equations or architectural details showing how stereo features remain influential when the text prompt already encodes an approximate volume; this leaves open the possibility that the model reduces to a trivial refinement of the text prior.
Experiments section: to substantiate that textual priors 'guide' rather than replace vision, the paper requires (a) a text-only ablation whose performance is substantially worse than the fused model and (b) explicit reporting of how approximate the volume values in the prompts actually are; neither control is described.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important areas for improving the clarity and rigor of our presentation. We address each major comment below and commit to revisions that will strengthen the empirical support and methodological transparency without altering the core contributions.

read point-by-point responses

Referee: Abstract: the claim of 'significant outperformance' on public datasets is presented without any quantitative metrics, ablation results, or error analysis, leaving the headline result without visible empirical grounding in the manuscript text.

Authors: We agree that the abstract would benefit from more concrete grounding. In the revised version we will insert the key quantitative results (e.g., relative reductions in mean absolute error and root-mean-square error versus the strongest vision-only baseline) together with a brief reference to the ablation studies reported in Section 4. revision: yes
Referee: Method section (projection layer description): the integration is characterized only as 'a simple yet effective projection layer' with no equations or architectural details showing how stereo features remain influential when the text prompt already encodes an approximate volume; this leaves open the possibility that the model reduces to a trivial refinement of the text prior.

Authors: We acknowledge the description is currently high-level. The projection layer is a learned linear transformation followed by a non-linear fusion that preserves the contribution of the stereo feature vector; we will add the explicit equations for the concatenation, projection matrix, and subsequent regression head so that readers can verify the stereo features are not overridden by the text prior. revision: yes
Referee: Experiments section: to substantiate that textual priors 'guide' rather than replace vision, the paper requires (a) a text-only ablation whose performance is substantially worse than the fused model and (b) explicit reporting of how approximate the volume values in the prompts actually are; neither control is described.

Authors: We agree these controls are necessary. We will add a text-only baseline (using only the language embedding) to Table 2 and report its error relative to the fused model. We will also include a new paragraph and supplementary table quantifying the approximation error distribution of the volume values supplied in the prompts (mean and standard deviation of the difference from ground-truth volumes) to demonstrate they function as coarse priors. revision: yes

Circularity Check

0 steps flagged

No circularity: multimodal fusion remains independent of input priors

full rationale

The paper describes a fusion architecture that ingests stereo image pairs and an independent text prompt (containing class and approximate volume) and produces a regressed volume via a projection layer. No equations or steps in the abstract or described method reduce the output to a direct function of the text prior by construction, nor do any self-citations serve as load-bearing uniqueness theorems. Experiments on public datasets are presented as external validation, keeping the central claim self-contained against the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on standard computer-vision and multimodal-learning assumptions without introducing new physical entities or many hand-chosen constants beyond ordinary neural-network training.

free parameters (1)

Projection layer parameters
Learned weights that map concatenated vision and language features into a shared space; fitted during end-to-end training.

axioms (2)

domain assumption Stereo image pairs encode usable implicit 3D cues for volume tasks.
Invoked when the method extracts features from stereo pairs to supplement single-view ambiguity.
domain assumption Natural-language text can supply reliable approximate priors for visual regression.
Stated when the text prompt (class plus approximate volume) is treated as useful guidance.

pith-pipeline@v0.9.0 · 5484 in / 1324 out tokens · 72832 ms · 2026-05-10T17:18:13.552889+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Flamingo: A visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, and Others. Flamingo: A visual language model for few-shot learning. Proceedings Of The 2022 Advances In Neural Information Processing Systems, 35:23716–23736, 2022. 2

2022
[2]

Language models are few-shot learners.Proceedings Of The 34th International Conference On Neural Information Pro- cessing Systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and Others. Language models are few-shot learners.Proceedings Of The 34th International Conference On Neural Information Pro- cessing Systems, 33:1877–1901, 2020. Article No.: 159. 5

1901
[3]

Metafood3d: Large 3d food object dataset with nutrition val- ues.ArXiv Preprint ArXiv:2409.01966, 2024

Yuhao Chen, Jiangpeng He, Chris Czarnecki, Gautham Vinod, Talha Ibn Mahmud, Siddeshwar Raghavan, Jinge Ma, Dayou Mao, Saeejith Nair, Pengcheng Xi, and Others. Metafood3d: Large 3d food object dataset with nutrition val- ues.ArXiv Preprint ArXiv:2409.01966, 2024. 5, 8, 1

work page arXiv 2024
[4]

Two-view 3d reconstruction for food volume estimation.IEEE Transactions On Multimedia, 19(5):1090–1099, 2017

Joachim Dehais, Marios Anthimopoulos, Sergey Shevchik, and Stavroula Mougiakakou. Two-view 3d reconstruction for food volume estimation.IEEE Transactions On Multimedia, 19(5):1090–1099, 2017. 2, 6, 8

2017
[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding.Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technolo- gies, volume 1 (long and short papers), pages 4171–4186,

2019
[6]

An image is worth 16x16 words: Transformers for image recognition at scale.Proceedings Of The International Conference On Learning Representations, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.Proceedings Of The International Conference On Learning Representations...

2021
[7]

Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, et al. Project aria: A new tool for egocentric multi-modal ai research.arXiv preprint arXiv:2308.13561, 2023. 2

work page internal anchor Pith review arXiv 2023
[8]

Shaobo Fang, Fengqing Zhu, Chufan Jiang, Song Zhang, Carol J Boushey, and Edward J Delp. A comparison of food portion size estimation using geometric models and depth images.Proceedings Of The 2016 IEEE International Con- ference On Image Processing (ICIP), pages 26–30, 2016. 5, 6

2016
[9]

Are we ready for autonomous driving? the kitti vision benchmark suite.Proceedings Of The 2012 IEEE Conference On Com- puter Vision And Pattern Recognition, pages 3354–3361,

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite.Proceedings Of The 2012 IEEE Conference On Com- puter Vision And Pattern Recognition, pages 3354–3361,

2012
[10]

Dpf-nutrition: Food nutrition estimation via depth prediction and fusion.Foods, 12(23):4293, 2023

Yuzhe Han, Qimin Cheng, Wenjin Wu, and Ziyang Huang. Dpf-nutrition: Food nutrition estimation via depth prediction and fusion.Foods, 12(23):4293, 2023. 2

2023
[11]

Metafood cvpr 2024 challenge on physically informed 3d food reconstruction: Methods and results.ArXiv Preprint ArXiv:2407.09285, 2024

Jiangpeng He, Yuhao Chen, Gautham Vinod, Talha Ibn Mahmud, Fengqing Zhu, Edward Delp, Alexander Wong, Pengcheng Xi, Ahmad AlMughrabi, Umair Haroon, et al. Metafood cvpr 2024 challenge on physically informed 3d food reconstruction: Methods and results.ArXiv Preprint ArXiv:2407.09285, 2024. 2

work page arXiv 2024
[12]

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 5

2024
[13]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,
[14]

Georga, and Dimitrios I

Fotis Konstantakopoulos, Eleni I. Georga, and Dimitrios I. Fotiadis. 3d reconstruction and volume estimation of food using stereo vision techniques.2021 IEEE 21st International Conference on Bioinformatics and Bioengineering (BIBE), pages 1–4, 2021. 2

2021
[15]

Nutrition estimation for dietary manage- ment: A transformer approach with depth sensing.IEEE Transactions on Multimedia, pages 1–13, 2025

Zhengyi Kwan, Wei Zhang, Zhengkui Wang, Aik Beng Ng, and Simon See. Nutrition estimation for dietary manage- ment: A transformer approach with depth sensing.IEEE Transactions on Multimedia, pages 1–13, 2025. 2

2025
[16]

LightGlue: Local Feature Matching at Light Speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. LightGlue: Local Feature Matching at Light Speed. Proceedings of the 2023 International Conference On Com- puter Vision (ICCV), 2023. 6

2023
[17]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2

2023
[18]

A system for brain tumor vol- ume estimation via mr imaging and fuzzy connectedness

Jianguo Liu, Jayaram K Udupa, Dewey Odhner, David Hackney, and Gul Moonis. A system for brain tumor vol- ume estimation via mr imaging and fuzzy connectedness. Computerized Medical Imaging And Graphics, 29(1):21–34,
[19]

Distinctive image features from scale- invariant keypoints.International journal of computer vi- sion, 60(2):91–110, 2004

David G Lowe. Distinctive image features from scale- invariant keypoints.International journal of computer vi- sion, 60(2):91–110, 2004. 6

2004
[20]

Aria Everyday Activities Dataset,

Zhaoyang Lv, Nicholas Charron, Pierre Moulon, Alexander Gamino, Cheng Peng, Chris Sweeney, Edward Miller, Huix- uan Tang, Jeff Meissner, Jing Dong, et al. Aria everyday activities dataset.arXiv preprint arXiv:2402.13349, 2024. 2

work page arXiv 2024
[21]

Mfp3d: Monocular food portion estimation leveraging 3d point clouds.ArXiv Preprint ArXiv:2411.10492, 2024

Jinge Ma, Xiaoyan Zhang, Gautham Vinod, Siddeshwar Raghavan, Jiangpeng He, and Fengqing Zhu. Mfp3d: Monocular food portion estimation leveraging 3d point clouds.ArXiv Preprint ArXiv:2411.10492, 2024. 8

work page arXiv 2024
[22]

Nerf: Represent- ing scenes as neural radiance fields for view synthesis.Pro- ceedings Of The 2020 European Conference On Computer Vision, pages 405–421, 2020

Ben Mildenhall, Pratul P Srinivasan, et al. Nerf: Represent- ing scenes as neural radiance fields for view synthesis.Pro- ceedings Of The 2020 European Conference On Computer Vision, pages 405–421, 2020. 2, 5, 6

2020
[23]

Usda food and nutrient database for dietary studies (fndds), 5.0.Procedia Food Science, 2:99– 112, 2013

Janice B Montville, Jaspreet KC Ahuja, Carrie L Martin, Kaushalya Y He, Grace Omolewa-Tomobi, Lois C Stein- feldt, Jaswinder Anand, Meghan E Adler, Randy P LaComb, and Alanna Moshfegh. Usda food and nutrient database for dietary studies (fndds), 5.0.Procedia Food Science, 2:99– 112, 2013. 8

2013
[24]

Springer, 2008

Andreas N ¨uchter.3D Robotic Mapping: The Simultane- ous Localization And Mapping Problem With Six Degrees Of Freedom. Springer, 2008. 2

2008
[25]

Accessed: Sep

OpenAI.GPT API, 2023. Accessed: Sep. 13, 2024. 6

2023
[26]

Gpt-5 system card

OpenAI. Gpt-5 system card. System card, OpenAI, 2025. Published August 7, 2025. 2, 5

2025
[27]

Aria digital twin: A new benchmark dataset for egocentric 3d machine percep- tion

Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Pe- ters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Yuheng Carl Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine percep- tion. pages 20133–20143, 2023. 2

2023
[28]

Hd-epic: A highly-detailed egocentric video dataset.Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23901–23913, 2025

Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Pra- jwal Gatti, Siddhant Bansal, Kevin Flanagan, et al. Hd-epic: A highly-detailed egocentric video dataset.Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23901–23913, 2025. 2

2025
[29]

Recognition and volume estimation of food intake using a mobile device.Proceedings Of The 2009 Workshop On Applications Of Computer Vision (WACV), pages 1–8, 2009

Manika Puri, Zhiwei Zhu, Qian Yu, Ajay Divakaran, and Harpreet Sawhney. Recognition and volume estimation of food intake using a mobile device.Proceedings Of The 2009 Workshop On Applications Of Computer Vision (WACV), pages 1–8, 2009. 2

2009
[30]

Learning transfer- able visual models from natural language supervision.Pro- ceedings Of The 2021 International Conference On Machine Learning, 2021

Alec Radford, Jong Wook Kim, et al. Learning transfer- able visual models from natural language supervision.Pro- ceedings Of The 2021 International Conference On Machine Learning, 2021. 2, 4, 5, 7

2021
[31]

Sentence-bert: Sentence embeddings using siamese bert-networks.Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing, 2019

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing, 2019. 5

2019
[32]

Orb: An efficient alternative to sift or surf.2011 International Conference on Computer Vision, pages 2564– 2571, 2011

Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf.2011 International Conference on Computer Vision, pages 2564– 2571, 2011. 6

2011
[33]

Structure-from-motion revisited.Conference on Computer Vision and Pattern Recognition (CVPR), 2016

Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited.Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2

2016
[34]

Pixelwise view selection for un- structured multi-view stereo.European Conference on Com- puter Vision (ECCV), 2016

Johannes Lutz Sch ¨onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo.European Conference on Com- puter Vision (ECCV), 2016. 2

2016
[35]

Mpnet: Masked and permuted pre-training for language understanding.Advances in neural information processing systems, 33:16857–16867, 2020

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding.Advances in neural information processing systems, 33:16857–16867, 2020. 5

2020
[36]

Nutrition5k: To- wards automatic nutritional understanding of generic food

Quin Thames, Arjun Karpur, Wade Norris, Fangting Xia, Liviu Panait, Tobias Weyand, and Jack Sim. Nutrition5k: To- wards automatic nutritional understanding of generic food. Proceedings Of The IEEE/CVF Conference On Computer Vi- sion And Pattern Recognition, pages 8903–8911, 2021. 2, 5, 6, 8

2021
[37]

Multiple pose virtual try-on based on 3d clothing reconstruction.IEEE Access, 9:114367–114380,

Thai Thanh Tuan, Matiur Rahman Minar, Heejune Ahn, and John Wainwright. Multiple pose virtual try-on based on 3d clothing reconstruction.IEEE Access, 9:114367–114380,
[38]

Food portion estimation: From pixels to calories.arXiv preprint arXiv:2602.05078,

Gautham Vinod and Fengqing Zhu. Food portion estimation: From pixels to calories.arXiv preprint arXiv:2602.05078,

work page arXiv
[39]

Image based food energy estimation with depth domain adaptation

Gautham Vinod, Zeman Shao, and Fengqing Zhu. Image based food energy estimation with depth domain adaptation. Proceedings Of The 2022 IEEE 5th International Confer- ence On Multimedia Information Processing And Retrieval (MIPR), pages 262–267, 2022. 2

2022
[40]

Food portion estimation via 3d object scaling.Pro- ceedings Of The 2024 IEEE/CVF Conference On Computer Vision And Pattern Recognition Workshops (CVPRW), pages 3741–3749, 2024

Gautham Vinod, Jiangpeng He, Zeman Shao, and Fengqing Zhu. Food portion estimation via 3d object scaling.Pro- ceedings Of The 2024 IEEE/CVF Conference On Computer Vision And Pattern Recognition Workshops (CVPRW), pages 3741–3749, 2024. 5, 6, 8

2024
[41]

Size Matters: Recon- structing real-scale 3d models from monocular images for food portion estimation.arXiv preprint arXiv:2601.20051,

Gautham Vinod, Bruce Coburn, Siddeshwar Raghavan, Jiangpeng He, and Fengqing Zhu. Size matters: Reconstruct- ing real-scale 3d models from monocular images for food portion estimation.arXiv preprint arXiv:2601.20051, 2026. 2

work page arXiv 2026
[42]

Chain-of- thought prompting elicits reasoning in large language mod- els.Proceedings Of The 36th International Conference On Machine Learning, pages 3759–3774, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Eric Chi, Quoc Le, and Denny Zhou. Chain-of- thought prompting elicits reasoning in large language mod- els.Proceedings Of The 36th International Conference On Machine Learning, pages 3759–3774, 2022. 5, 6, 8

2022
[43]

Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Jiawei Ren, Liang Pan, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, and Others. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and genera- tion.Proceedings Of The IEEE/CVF Conference On Com- puter Vision And Pattern Recognition, pages 803–814, 2023. 5, 1

2023
[44]

Boushey, and Ed- ward J

Chang Xu, Ye He, Nitin Khanna, Carol J. Boushey, and Ed- ward J. Delp. Model-based food volume estimation using 3d pose.Proceedings of the 2013 IEEE International Confer- ence on Image Processing, pages 2534–2538, 2013. 2

2013
[45]

Research on the application of deep learning in med- ical image segmentation and 3d reconstruction.Academic Journal Of Science And Technology, 10(2):8–12, 2024

Yun Zi, Qi Wang, Zijun Gao, Xiaohan Cheng, and Taiyuan Mei. Research on the application of deep learning in med- ical image segmentation and 3d reconstruction.Academic Journal Of Science And Technology, 10(2):8–12, 2024. 1, 2 Not Your Stereo-Typical Estimator: Combining Vision and Language for Volume Perception Supplementary Material

2024
[46]

context-free

GPT-5 Experiment Prompts For our experiments with GPT-5, the prompts are designed to ensure clarity in how the model is instructed and what data it receives. We used two different prompting structures depending on whether monocular (single image) or stereo (two images) were used for volume estimation. Single Image PromptThe single-image prompt asks the mo...
[47]

RGB Only

Generalizability To evaluate our model’s ability to generalize to un- seen object categories, we performed experiments on the MetaFood3D dataset [3]. We used a random train-test split, resulting in 415 training and 104 testing samples, which en- sures that the test set contains categories absent from train- ing. Given the dataset’s limited size, strong ge...
[48]

line of best fit

Additional Error Visualizations In Figure 6, we perform the same error analysis for Om- niObject as we did in Figure 5 (From main paper). We sim- ilarly observe that our method outperforms all other monoc- ular methods, having more effective absolute error and ab- solute percentage error distributions. Figure 6.Error Distribution of Volume Estimation Meth...