Fake or Real, Can Robots Tell? Evaluating VLM Robustness to Domain Shift in Single-View Robotic Scene Understanding
Pith reviewed 2026-05-19 08:00 UTC · model grok-4.3
The pith
VLMs describe real objects well but degrade on 3D-printed replicas in robot scenes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
While VLMs describe common real-world objects effectively, performance degrades markedly on 3D-printed items despite their structurally familiar forms. Standard evaluation metrics fail to detect domain shifts entirely or reward fluent but factually incorrect captions.
What carries the argument
Controlled physical domain shift using real-world tools paired with geometrically similar 3D-printed counterparts that differ only in texture, colour, and material.
If this is right
- Robotic systems using VLMs for scene description must add safeguards against changes in object surface properties.
- Evaluation protocols for robotic VLMs should include controlled physical domain shifts to remain reliable.
- More robust model architectures are needed for embodied agents that encounter varied real-world materials.
Where Pith is reading between the lines
- Robots in homes or factories may encounter similar captioning problems with manufactured or altered items.
- Training VLMs on mixed real and synthetic surface data could reduce the observed drops.
- Current benchmarks may overestimate readiness for physical deployment until metrics improve.
Load-bearing premise
The specific texture, colour, and material differences between real objects and 3D-printed replicas stand for the domain shifts that robots meet in practical applications.
What would settle it
A follow-up test on new object sets or additional VLMs that finds equal caption quality for real and 3D-printed items and metrics that reliably flag the shift would challenge the degradation claim.
Figures
read the original abstract
Robotic scene understanding increasingly relies on Vision-Language Models (VLMs) to generate natural language descriptions of the environment. In this work, we systematically evaluate single-view object captioning for tabletop scenes captured by a robotic manipulator, introducing a controlled physical domain shift that contrasts real-world tools with geometrically similar 3D-printed counterparts that differ in texture, colour, and material. We benchmark a suite of state-of-the-art, locally deployable VLMs across multiple metrics to assess semantic alignment and factual grounding. Our results demonstrate that while VLMs describe common real-world objects effectively, performance degrades markedly on 3D-printed items despite their structurally familiar forms. We further expose critical vulnerabilities in standard evaluation metrics, showing that some fail to detect domain shifts entirely or reward fluent but factually incorrect captions. These findings highlight the limitations of deploying foundation models for embodied agents and the need for more robust architectures and evaluation protocols in physical robotic applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates the robustness of locally deployable Vision-Language Models (VLMs) for single-view object captioning in robotic tabletop scenes. It introduces a controlled physical domain shift by comparing real-world objects to geometrically similar 3D-printed counterparts that differ only in texture, colour, and material, then benchmarks multiple state-of-the-art VLMs across standard metrics for semantic alignment and factual grounding. The central claims are that VLMs perform well on real objects but degrade markedly on the 3D-printed versions, and that several common metrics fail to detect the domain shift or reward fluent but factually incorrect captions.
Significance. If the quantitative results hold, the work is significant for the robotics and embodied AI community. It provides a clean, physically grounded testbed that isolates appearance-based domain shift while preserving geometry, directly relevant to manipulator-based scene understanding. The demonstration of metric vulnerabilities is a useful contribution that could inform better evaluation protocols. The study also supplies concrete evidence that current foundation models may require additional robustness techniques before reliable deployment on physical robots.
major comments (2)
- [Discussion] The central implication that the observed degradation indicates limitations for 'embodied agents' and 'real robotic applications' rests on the assumption that a texture/colour/material shift with fixed geometry is representative of the domain shifts robots actually encounter. Real deployments typically combine this factor with simultaneous changes in illumination, viewpoint, partial occlusion, and sensor noise (see §1 and §5). The paper should either expand the experimental design to include at least one additional shift factor or qualify the deployment claims in the discussion and conclusion.
- [Results] Results section: the headline finding of 'marked degradation' on 3D-printed items is load-bearing for the robustness claim, yet the abstract and summary provide no numerical scores, model names, dataset sizes, or statistical tests. Please ensure the results tables report per-metric deltas (e.g., CIDEr or SPICE drop from real to printed) with confidence intervals or significance tests so readers can judge effect size.
minor comments (2)
- The abstract would be strengthened by including at least one concrete quantitative result (e.g., average metric drop or number of models/datasets) to allow readers to gauge the magnitude of the reported degradation and metric failures.
- Figure 1 or the experimental setup figure: add a side-by-side caption or inset that explicitly labels the real vs. 3D-printed versions of the same object so the controlled nature of the shift is immediately visible.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the scope and presentation of our work. We respond to each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Discussion] The central implication that the observed degradation indicates limitations for 'embodied agents' and 'real robotic applications' rests on the assumption that a texture/colour/material shift with fixed geometry is representative of the domain shifts robots actually encounter. Real deployments typically combine this factor with simultaneous changes in illumination, viewpoint, partial occlusion, and sensor noise (see §1 and §5). The paper should either expand the experimental design to include at least one additional shift factor or qualify the deployment claims in the discussion and conclusion.
Authors: We agree that real robotic deployments involve compound domain shifts. Our study deliberately isolates the appearance-based shift (texture, colour, material) while holding geometry constant to create a clean, physically grounded testbed. This controlled design allows us to attribute performance drops specifically to visual domain shift rather than confounding geometric or viewpoint changes. In the revised manuscript we will qualify the deployment claims in both the discussion and conclusion, explicitly stating that the observed degradation demonstrates vulnerability to appearance shifts and that additional robustness measures will likely be required when such shifts co-occur with illumination, viewpoint, or occlusion changes typical of physical robot operation. revision: yes
-
Referee: [Results] Results section: the headline finding of 'marked degradation' on 3D-printed items is load-bearing for the robustness claim, yet the abstract and summary provide no numerical scores, model names, dataset sizes, or statistical tests. Please ensure the results tables report per-metric deltas (e.g., CIDEr or SPICE drop from real to printed) with confidence intervals or significance tests so readers can judge effect size.
Authors: The results section and accompanying tables already list per-model scores on both real and 3D-printed objects for all metrics (CIDEr, SPICE, etc.) together with the number of scenes and objects evaluated. To improve readability and effect-size assessment we will add explicit per-metric delta columns (real minus printed) and include 95% confidence intervals computed via bootstrap resampling. We will also report paired statistical significance tests (Wilcoxon signed-rank) between real and printed conditions for each metric and model. These additions will be placed in the main results tables and referenced in the text. revision: yes
Circularity Check
Empirical benchmarking study with no derivations, fitted parameters, or self-referential predictions.
full rationale
The paper is a controlled empirical evaluation of off-the-shelf VLMs on a custom dataset of real vs. geometrically matched 3D-printed objects. It reports observed captioning performance and metric behaviors without any derivation chain, parameter fitting, or 'prediction' steps that reduce to the inputs by construction. Central claims rest on direct model outputs and standard metrics applied to the collected scenes. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The representativeness of the texture/material shift for broader robotic domain shifts is a question of external validity, not an internal circularity in any claimed derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard VLM evaluation metrics can measure semantic alignment and factual grounding in object captions
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
controlled physical domain shift that contrasts real-world tools with geometrically similar 3D-printed counterparts that differ in texture, colour, and material
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
performance degrades markedly on 3D-printed items despite their structurally familiar forms
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ahn, M., et al: Do as i can, not as i say: Grounding language in robotic affordances (2022), https://arxiv.org/abs/2204.01691
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Advances in neural information processing systems35, 23716– 23736 (2022)
Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022)
work page 2022
-
[3]
Anderson, P., Fernando, B., Johnson, M., Gould, S.: Spice: Semantic propositional image caption evaluation. In: Computer Vision–ECCV 2016: 14th European Con- ference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14 (2016)
work page 2016
-
[4]
Bai, S., et al.: Qwen2.5-vl technical report (2025), https://arxiv.org/abs/2502.13923
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with im- proved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization. pp. 65–72 (2005)
work page 2005
-
[6]
De Gregorio, D., Tombari, F., Di Stefano, L.: Robotfusion: Grasping with a robotic manipulator via multi-view reconstruction. In: Hua, G., Jégou, H. (eds.) Computer Vision – ECCV 2016 Workshops. Springer International Publishing (2016)
work page 2016
-
[7]
Grattafiori, A., et al.: The llama 3 herd of models (2024), https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
In: IEEE International Conference on Robotics and Automation (ICRA) (2024)
Kapelyukh, I., Ren, Y., Alzugaray, I., Johns, E.: Dream2Real: Zero-shot 3D object rearrangement with vision-language models. In: IEEE International Conference on Robotics and Automation (ICRA) (2024)
work page 2024
-
[9]
In: Proceedings of the 40th International Conference on Machine Learning
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202, pp. 19730–19742. PMLR (23–29 Jul 2023)
work page 2023
-
[10]
In: International confer- ence on machine learning
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International confer- ence on machine learning. pp. 12888–12900. PMLR (2022)
work page 2022
-
[11]
In: Text sum- marization branches out
Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out. pp. 74–81 (2004)
work page 2004
-
[12]
Lin, H.Y., Liang, S.C., Chen, Y.K.: Robotic grasping with multi-view image ac- quisition and model-based pose estimation. IEEE Sensors Journal (2021)
work page 2021
-
[13]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26296–26306 (2024) Fake or Real, Can Robots Tell? 15
work page 2024
-
[14]
Advances in neural information processing systems36, 34892–34916 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)
work page 2023
-
[15]
Commu- nications of the ACM65(1), 99–106 (2021)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu- nications of the ACM65(1), 99–106 (2021)
work page 2021
-
[16]
OpenAI: Gpt-4 technical report (2024), https://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. p. 311–318. Association for Computa- tional Linguistics (2002)
work page 2002
-
[18]
arXiv preprint arXiv:2406.18158 (2024)
Qian, S., Mo, K., Blukis, V., Fouhey, D.F., Fox, D., Goyal, A.: 3d-mvp: 3d multi- view pretraining for robotic manipulation. arXiv preprint arXiv:2406.18158 (2024)
-
[19]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)
work page 2021
-
[20]
SAM 2: Segment Anything in Images and Videos
Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Shridhar, M., Thomason, J., Gordon, D., Bisk, Y., Han, W., Mottaghi, R., Zettle- moyer, L., Fox, D.: Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10740–10749 (2020)
work page 2020
-
[22]
In: Proceedings of the IEEE international conference on computer vision
Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3d shape recognition. In: Proceedings of the IEEE international conference on computer vision. pp. 945–953 (2015)
work page 2015
-
[23]
Team, G., et al.: Gemini: A family of highly capable multimodal models (2025), https://arxiv.org/abs/2312.11805
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al.: Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image de- scription evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015)
work page 2015
-
[26]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3156–3164 (2015)
work page 2015
-
[27]
In: International conference on machine learning
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. pp. 2048–2057. PMLR (2015)
work page 2048
-
[28]
In: 2024 IEEE International Conference on Robotics and Automation (ICRA)
Yang, J., Chen, X., Qian, S., Madaan, N., Iyengar, M., Fouhey, D.F., Chai, J.: Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 7694–7701. IEEE (2024)
work page 2024
-
[29]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4584–4593 (2016)
work page 2016
-
[30]
BERTScore: Evaluating Text Generation with BERT
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1904
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.