pith. sign in

arxiv: 2506.19579 · v3 · submitted 2025-06-24 · 💻 cs.RO · cs.AI· cs.CL· cs.CV· cs.LG

Fake or Real, Can Robots Tell? Evaluating VLM Robustness to Domain Shift in Single-View Robotic Scene Understanding

Pith reviewed 2026-05-19 08:00 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CLcs.CVcs.LG
keywords vision-language modelsdomain shiftrobotic scene understanding3D-printed objectsobject captioningevaluation metricsembodied agents
0
0 comments X

The pith

VLMs describe real objects well but degrade on 3D-printed replicas in robot scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates vision-language models on single-view object captioning for tabletop scenes captured by a robotic arm. It creates a test using real tools next to 3D-printed versions that match in shape but differ in texture, colour and material. Models handle common real items competently yet produce less accurate captions for the printed ones. Standard automatic metrics often miss the performance drop or assign high scores to fluent but wrong descriptions. The results point to limits in deploying current models for physical robot tasks where object appearances vary.

Core claim

While VLMs describe common real-world objects effectively, performance degrades markedly on 3D-printed items despite their structurally familiar forms. Standard evaluation metrics fail to detect domain shifts entirely or reward fluent but factually incorrect captions.

What carries the argument

Controlled physical domain shift using real-world tools paired with geometrically similar 3D-printed counterparts that differ only in texture, colour, and material.

If this is right

  • Robotic systems using VLMs for scene description must add safeguards against changes in object surface properties.
  • Evaluation protocols for robotic VLMs should include controlled physical domain shifts to remain reliable.
  • More robust model architectures are needed for embodied agents that encounter varied real-world materials.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robots in homes or factories may encounter similar captioning problems with manufactured or altered items.
  • Training VLMs on mixed real and synthetic surface data could reduce the observed drops.
  • Current benchmarks may overestimate readiness for physical deployment until metrics improve.

Load-bearing premise

The specific texture, colour, and material differences between real objects and 3D-printed replicas stand for the domain shifts that robots meet in practical applications.

What would settle it

A follow-up test on new object sets or additional VLMs that finds equal caption quality for real and 3D-printed items and metrics that reliably flag the shift would challenge the degradation claim.

Figures

Figures reproduced from arXiv: 2506.19579 by Amber Drinkwater, Angelo Cangelosi, Federico Tavella.

Figure 1
Figure 1. Figure 1: Our experimental setup: the Franka Emika Research 3 is equipped with [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Given a set of input images depicting each several objects from multiple [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Our test objects. Each set is composed by 10 different objects. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of scores for each VLM. Dots with no transparency and black [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: For each object on the real set and for each VLM, we calculated the [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: For each object on the 3D printed set and for each VLM, we calculated the [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Robotic scene understanding increasingly relies on Vision-Language Models (VLMs) to generate natural language descriptions of the environment. In this work, we systematically evaluate single-view object captioning for tabletop scenes captured by a robotic manipulator, introducing a controlled physical domain shift that contrasts real-world tools with geometrically similar 3D-printed counterparts that differ in texture, colour, and material. We benchmark a suite of state-of-the-art, locally deployable VLMs across multiple metrics to assess semantic alignment and factual grounding. Our results demonstrate that while VLMs describe common real-world objects effectively, performance degrades markedly on 3D-printed items despite their structurally familiar forms. We further expose critical vulnerabilities in standard evaluation metrics, showing that some fail to detect domain shifts entirely or reward fluent but factually incorrect captions. These findings highlight the limitations of deploying foundation models for embodied agents and the need for more robust architectures and evaluation protocols in physical robotic applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates the robustness of locally deployable Vision-Language Models (VLMs) for single-view object captioning in robotic tabletop scenes. It introduces a controlled physical domain shift by comparing real-world objects to geometrically similar 3D-printed counterparts that differ only in texture, colour, and material, then benchmarks multiple state-of-the-art VLMs across standard metrics for semantic alignment and factual grounding. The central claims are that VLMs perform well on real objects but degrade markedly on the 3D-printed versions, and that several common metrics fail to detect the domain shift or reward fluent but factually incorrect captions.

Significance. If the quantitative results hold, the work is significant for the robotics and embodied AI community. It provides a clean, physically grounded testbed that isolates appearance-based domain shift while preserving geometry, directly relevant to manipulator-based scene understanding. The demonstration of metric vulnerabilities is a useful contribution that could inform better evaluation protocols. The study also supplies concrete evidence that current foundation models may require additional robustness techniques before reliable deployment on physical robots.

major comments (2)
  1. [Discussion] The central implication that the observed degradation indicates limitations for 'embodied agents' and 'real robotic applications' rests on the assumption that a texture/colour/material shift with fixed geometry is representative of the domain shifts robots actually encounter. Real deployments typically combine this factor with simultaneous changes in illumination, viewpoint, partial occlusion, and sensor noise (see §1 and §5). The paper should either expand the experimental design to include at least one additional shift factor or qualify the deployment claims in the discussion and conclusion.
  2. [Results] Results section: the headline finding of 'marked degradation' on 3D-printed items is load-bearing for the robustness claim, yet the abstract and summary provide no numerical scores, model names, dataset sizes, or statistical tests. Please ensure the results tables report per-metric deltas (e.g., CIDEr or SPICE drop from real to printed) with confidence intervals or significance tests so readers can judge effect size.
minor comments (2)
  1. The abstract would be strengthened by including at least one concrete quantitative result (e.g., average metric drop or number of models/datasets) to allow readers to gauge the magnitude of the reported degradation and metric failures.
  2. Figure 1 or the experimental setup figure: add a side-by-side caption or inset that explicitly labels the real vs. 3D-printed versions of the same object so the controlled nature of the shift is immediately visible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope and presentation of our work. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Discussion] The central implication that the observed degradation indicates limitations for 'embodied agents' and 'real robotic applications' rests on the assumption that a texture/colour/material shift with fixed geometry is representative of the domain shifts robots actually encounter. Real deployments typically combine this factor with simultaneous changes in illumination, viewpoint, partial occlusion, and sensor noise (see §1 and §5). The paper should either expand the experimental design to include at least one additional shift factor or qualify the deployment claims in the discussion and conclusion.

    Authors: We agree that real robotic deployments involve compound domain shifts. Our study deliberately isolates the appearance-based shift (texture, colour, material) while holding geometry constant to create a clean, physically grounded testbed. This controlled design allows us to attribute performance drops specifically to visual domain shift rather than confounding geometric or viewpoint changes. In the revised manuscript we will qualify the deployment claims in both the discussion and conclusion, explicitly stating that the observed degradation demonstrates vulnerability to appearance shifts and that additional robustness measures will likely be required when such shifts co-occur with illumination, viewpoint, or occlusion changes typical of physical robot operation. revision: yes

  2. Referee: [Results] Results section: the headline finding of 'marked degradation' on 3D-printed items is load-bearing for the robustness claim, yet the abstract and summary provide no numerical scores, model names, dataset sizes, or statistical tests. Please ensure the results tables report per-metric deltas (e.g., CIDEr or SPICE drop from real to printed) with confidence intervals or significance tests so readers can judge effect size.

    Authors: The results section and accompanying tables already list per-model scores on both real and 3D-printed objects for all metrics (CIDEr, SPICE, etc.) together with the number of scenes and objects evaluated. To improve readability and effect-size assessment we will add explicit per-metric delta columns (real minus printed) and include 95% confidence intervals computed via bootstrap resampling. We will also report paired statistical significance tests (Wilcoxon signed-rank) between real and printed conditions for each metric and model. These additions will be placed in the main results tables and referenced in the text. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study with no derivations, fitted parameters, or self-referential predictions.

full rationale

The paper is a controlled empirical evaluation of off-the-shelf VLMs on a custom dataset of real vs. geometrically matched 3D-printed objects. It reports observed captioning performance and metric behaviors without any derivation chain, parameter fitting, or 'prediction' steps that reduce to the inputs by construction. Central claims rest on direct model outputs and standard metrics applied to the collected scenes. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The representativeness of the texture/material shift for broader robotic domain shifts is a question of external validity, not an internal circularity in any claimed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical evaluation study that relies on standard assumptions from machine learning benchmarking rather than introducing new mathematical constructs or fitted parameters.

axioms (1)
  • domain assumption Standard VLM evaluation metrics can measure semantic alignment and factual grounding in object captions
    Invoked when benchmarking performance across real and printed objects

pith-pipeline@v0.9.0 · 5709 in / 1393 out tokens · 37194 ms · 2026-05-19T08:00:03.340310+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 8 internal anchors

  1. [1]

    Ahn, M., et al: Do as i can, not as i say: Grounding language in robotic affordances (2022), https://arxiv.org/abs/2204.01691

  2. [2]

    Advances in neural information processing systems35, 23716– 23736 (2022)

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022)

  3. [3]

    In: Computer Vision–ECCV 2016: 14th European Con- ference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14 (2016)

    Anderson, P., Fernando, B., Johnson, M., Gould, S.: Spice: Semantic propositional image caption evaluation. In: Computer Vision–ECCV 2016: 14th European Con- ference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14 (2016)

  4. [4]

    Bai, S., et al.: Qwen2.5-vl technical report (2025), https://arxiv.org/abs/2502.13923

  5. [5]

    In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization

    Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with im- proved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization. pp. 65–72 (2005)

  6. [6]

    In: Hua, G., Jégou, H

    De Gregorio, D., Tombari, F., Di Stefano, L.: Robotfusion: Grasping with a robotic manipulator via multi-view reconstruction. In: Hua, G., Jégou, H. (eds.) Computer Vision – ECCV 2016 Workshops. Springer International Publishing (2016)

  7. [7]

    Grattafiori, A., et al.: The llama 3 herd of models (2024), https://arxiv.org/abs/2407.21783

  8. [8]

    In: IEEE International Conference on Robotics and Automation (ICRA) (2024)

    Kapelyukh, I., Ren, Y., Alzugaray, I., Johns, E.: Dream2Real: Zero-shot 3D object rearrangement with vision-language models. In: IEEE International Conference on Robotics and Automation (ICRA) (2024)

  9. [9]

    In: Proceedings of the 40th International Conference on Machine Learning

    Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202, pp. 19730–19742. PMLR (23–29 Jul 2023)

  10. [10]

    In: International confer- ence on machine learning

    Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International confer- ence on machine learning. pp. 12888–12900. PMLR (2022)

  11. [11]

    In: Text sum- marization branches out

    Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out. pp. 74–81 (2004)

  12. [12]

    IEEE Sensors Journal (2021)

    Lin, H.Y., Liang, S.C., Chen, Y.K.: Robotic grasping with multi-view image ac- quisition and model-based pose estimation. IEEE Sensors Journal (2021)

  13. [13]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26296–26306 (2024) Fake or Real, Can Robots Tell? 15

  14. [14]

    Advances in neural information processing systems36, 34892–34916 (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

  15. [15]

    Commu- nications of the ACM65(1), 99–106 (2021)

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu- nications of the ACM65(1), 99–106 (2021)

  16. [16]

    OpenAI: Gpt-4 technical report (2024), https://arxiv.org/abs/2303.08774

  17. [17]

    In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics

    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. p. 311–318. Association for Computa- tional Linguistics (2002)

  18. [18]

    arXiv preprint arXiv:2406.18158 (2024)

    Qian, S., Mo, K., Blukis, V., Fouhey, D.F., Fox, D., Goyal, A.: 3d-mvp: 3d multi- view pretraining for robotic manipulation. arXiv preprint arXiv:2406.18158 (2024)

  19. [19]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  20. [20]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

  21. [21]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Shridhar, M., Thomason, J., Gordon, D., Bisk, Y., Han, W., Mottaghi, R., Zettle- moyer, L., Fox, D.: Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10740–10749 (2020)

  22. [22]

    In: Proceedings of the IEEE international conference on computer vision

    Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3d shape recognition. In: Proceedings of the IEEE international conference on computer vision. pp. 945–953 (2015)

  23. [23]

    Team, G., et al.: Gemini: A family of highly capable multimodal models (2025), https://arxiv.org/abs/2312.11805

  24. [24]

    Gemma 3 Technical Report

    Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al.: Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025)

  25. [25]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015)

    Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image de- scription evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015)

  26. [26]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3156–3164 (2015)

  27. [27]

    In: International conference on machine learning

    Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. pp. 2048–2057. PMLR (2015)

  28. [28]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

    Yang, J., Chen, X., Qian, S., Madaan, N., Iyengar, M., Fouhey, D.F., Chai, J.: Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 7694–7701. IEEE (2024)

  29. [29]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4584–4593 (2016)

  30. [30]

    BERTScore: Evaluating Text Generation with BERT

    Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019)