pith. machine review for the scientific record. sign in

arxiv: 2605.03927 · v1 · submitted 2026-05-05 · 💻 cs.CV

Recognition: unknown

StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-07 17:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsauxiliary regression lossrobotic affordance reasoningobject state localizationOSAR benchmarknumerical reasoningbounding box decoder
0
0 comments X

The pith

An auxiliary regression loss computed from box decoder outputs during fine-tuning improves vision-language models' numerical reasoning for precise object and state localization in robotic affordance tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard vision-language models struggle with numerical tasks like object detection and state localization needed for robotics. It introduces a training method that adds an auxiliary regression loss derived from box decoder outputs to strengthen these abilities, while leaving the model's normal text generation unchanged at inference time. StateVLM applies this to learn detailed object positions, states, and graspable regions. The authors also release the OSAR benchmark containing over 1,100 scenes with thousands of annotated objects to test affordance reasoning. Experiments report consistent gains from the loss, averaging 1.6 percent on referring expression datasets and 5.2 percent on the new benchmark, with added benefits for output consistency in complex scenarios.

Core claim

StateVLM adapts vision-language models by computing an Auxiliary Regression Loss from box decoder outputs during fine-tuning to improve numerical reasoning in object detection, object-state localization, and graspable region identification, while retaining standard sequence prediction at inference. On adapted RefCOCO benchmarks this yields an average 1.6 percent gain, and on the introduced OSAR benchmark of 1,172 scenes the model with the loss outperforms versions without it by an average of 5.2 percent, with particular gains in consistency for affordance reasoning.

What carries the argument

Auxiliary Regression Loss (ARL) computed from box decoder outputs during fine-tuning, serving as an extra training signal that strengthens numerical accuracy for localization without changing the sequence prediction used at test time.

If this is right

  • Models trained with ARL achieve higher accuracy on referring expression comprehension tasks such as RefCOCO, RefCOCO+, and RefCOCOg.
  • The performance lift is larger on the OSAR benchmark, particularly for the complex affordance reasoning task.
  • ARL improves the consistency of model outputs when reasoning about object states and graspable regions.
  • StateVLM learns fine-grained representations that support both localization and graspability for robotic use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The training approach could extend to other vision-language models to support more quantitative instructions in robotic planning without requiring new architectures.
  • The OSAR benchmark offers a public test set that future models can use to measure progress on state-aware affordance tasks.
  • Combining regression signals with language objectives may help address similar numerical weaknesses in other multimodal settings.

Load-bearing premise

That adding the auxiliary regression loss during training will integrate without causing instabilities or degrading the model's behavior on non-regression tasks.

What would settle it

A training run in which StateVLM with the auxiliary loss shows lower accuracy than the version without it on the OSAR benchmark, or exhibits unstable convergence, would disprove the claimed benefit.

Figures

Figures reproduced from arXiv: 2605.03927 by Matthias Kerzel, Mengdi Li, Paul Striker, Stefan Wermter, Xiaowen Sun, Xufeng Zhao.

Figure 1
Figure 1. Figure 1: The overall framework of StateVLM. During the training phase (a), a spe￾cially designed box decoder converts sequence predictions into embedding predictions for calculating an auxiliary regression loss. In the inference phase (b), however, the model continues to rely on standard sequence-based text and number predictions, because the LLM backbone of the VLM is originally trained on sequence prediction. sig… view at source ↗
Figure 2
Figure 2. Figure 2: The decomposition of robotic manipulation into macro- and micro-level tasks. view at source ↗
Figure 3
Figure 3. Figure 3: Example scenes in OSAR: (a) Simple scenes are defined as those with only one object from each category. (b) and (c) Complex scenes are defined as those with multiple objects from each category in various states. The red box marks the ideal grasp region; this is an example of how an object’s state affects its affordance, as the area covered with food should be avoided when grasping view at source ↗
Figure 4
Figure 4. Figure 4: Object statistics in OSAR. Semi-solid foods usually include sauces, pasta, soup, view at source ↗
Figure 5
Figure 5. Figure 5: StateVLM loss curves: (a) StateVLM (LCLM and LCLM+ARL) training loss progression over steps, (b) Validation loss for StateVLM (LCLM), and (c) Validation loss for StateVLM (LCLM+ARL). We first assess the performance of the backbone model, MiniCPM-V, as a baseline. Then, we conduct two groups of full fine-tuning experiments: one in which the model is trained solely with the standard CLM objective and another… view at source ↗
Figure 6
Figure 6. Figure 6: StateVLM (LCLM+ARL) performance improves significantly from 5,000 to 15,000 steps and is better than StateVLM (LCLM) after 10,000 steps. Stat￾eVLM (LCLM) performance, however, remains stable or even decreases over the training, which is consistent with the validation loss curve in Fig. 5b. 5000 10000 15000 Model Step 74 76 78 80 82 84 86 Acc@0.5 84.3 84.0 83.8 82.6 85.1 85.3 RefCOCO 5000 10000 15000 Model … view at source ↗
Figure 7
Figure 7. Figure 7: StateVLM (LCLM) performance changes over training steps. performance of StateVLM (LCLM) remains stable between 5,000 and 10,000 steps and decreases at 15,000 steps, indicating that it reaches its maximum performance at 5,000 steps. We continue training the StateVLM (LCLM+ARL) until 25,000 steps and the performance gradually improves over the training steps on RefCOCO, RefCOCO+, and RefCOCOg, as shown in view at source ↗
Figure 8
Figure 8. Figure 8: StateVLM (LCLM+ARL) performance changes over training steps. eVLM (LCLM+ARL) achieves a 1.6% performance improvement over the Stat￾eVLM (LCLM). Overall, these results demonstrate that incorporating an auxiliary regression loss can significantly enhance the performance of VLMs on bounding box prediction tasks, validating our hypothesis about the limi￾tations of the current training paradigm for object local… view at source ↗
Figure 9
Figure 9. Figure 9: Comprehensive performance comparison on OSAR. view at source ↗
Figure 10
Figure 10. Figure 10: For object detection, the goal is to predict bounding boxes cover view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison of StateVLM (LCLM) and StateVLM (LCLM+ARL) on grounded object detection and affordance reasoning examples. Ground-truth boxes are shown in green, while model predictions are shown in orange (StateVLM (LCLM)) and teal (StateVLM (LCLM+ARL)). and practical deployment, motivating further research. 6. Conclusion and Future Work In this paper, we propose StateVLM, a novel model designed t… view at source ↗
read the original abstract

Vision-language models (VLMs) have shown remarkable performance in various robotic tasks, as they can perceive visual information and understand natural language instructions. However, when applied to robotics, VLMs remain subject to a fundamental limitation inherent in large language models (LLMs): they struggle with numerical reasoning, particularly in object detection and object-state localization. To explore numerical reasoning as a regression task in VLMs, we propose a novel training strategy to adapt VLMs for object detection and object-state localization. This approach leverages box decoder outputs to compute an Auxiliary Regression Loss (ARL) during fine-tuning, while preserving standard sequence prediction at inference. We leverage this training strategy to develop StateVLM (State-aware Vision-Language Model), a novel model designed to perceive and learn fine-grained object representations, including precise localization of objects and their states, as well as graspable regions. Due to the lack of a benchmark for object-state affordance reasoning, we introduce an open-source benchmark, Object State Affordance Reasoning (OSAR), which contains 1,172 scenes with 7,746 individual objects and corresponding bounding boxes. Comparative experiments on adapted benchmarks (RefCOCO, RefCOCO+, and \mbox{RefCOCOg}) demonstrate that ARL improves model performance by an average of 1.6\% compared to models without ARL. Experiments on the OSAR benchmark further support this finding, showing that StateVLM with ARL achieves an average of 5.2\% higher performance than models without ARL. In particular, ARL is also important for the complex task of affordance reasoning in OSAR, where it enhances the consistency of model outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes StateVLM, a vision-language model for robotic affordance reasoning that incorporates an Auxiliary Regression Loss (ARL) computed from box decoder outputs during fine-tuning to improve numerical reasoning for object detection, state localization, and graspable regions. Standard next-token prediction is retained at inference. The authors introduce the OSAR benchmark (1,172 scenes, 7,746 objects) and report that ARL yields average gains of 1.6% on adapted RefCOCO/RefCOCO+/RefCOCOg and 5.2% on OSAR relative to baselines without ARL, with particular benefit for consistency in complex affordance tasks.

Significance. If the ARL approach can be shown not to degrade general VLM capabilities, it would provide a lightweight, inference-preserving method for injecting regression-style numerical reasoning into VLMs, addressing a known limitation in robotic applications. The OSAR benchmark is a useful contribution for evaluating object-state affordance. The current empirical support is moderate, resting on gains whose reliability is not fully established by the reported details.

major comments (3)
  1. [Experiments] Experiments section (comparative results on RefCOCO variants and OSAR): the reported 1.6% and 5.2% average improvements lack any mention of statistical significance testing, standard deviation across multiple runs, or exact baseline implementation details (e.g., whether baselines were re-trained with identical hyperparameters). This undermines confidence that the gains are robust rather than sensitive to post-hoc choices.
  2. [Method] Training strategy description (ARL integration): no results are provided on held-out general VLM tasks such as VQA or image captioning after ARL fine-tuning. This is load-bearing for the central claim that ARL can be added without introducing distribution shift that degrades core sequence-prediction capabilities.
  3. [Method] ARL formulation (loss weighting): the single free hyperparameter (ARL loss weight) is introduced without ablation or sensitivity analysis, leaving open whether the reported gains depend on a narrow choice of this parameter.
minor comments (2)
  1. [Abstract] The abstract and method sections use 'average of 5.2% higher performance' without clarifying the exact metric aggregation (e.g., mean over which sub-tasks or scenes).
  2. [Benchmark] OSAR benchmark description would benefit from explicit discussion of how scenes were collected and annotated to allow assessment of potential biases.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our paper. We have carefully considered each point and provide detailed responses below. We agree that additional experiments and clarifications will strengthen the manuscript and plan to incorporate them in the revised version.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (comparative results on RefCOCO variants and OSAR): the reported 1.6% and 5.2% average improvements lack any mention of statistical significance testing, standard deviation across multiple runs, or exact baseline implementation details (e.g., whether baselines were re-trained with identical hyperparameters). This undermines confidence that the gains are robust rather than sensitive to post-hoc choices.

    Authors: We acknowledge the importance of statistical rigor in reporting results. In the revised manuscript, we will conduct additional experiments with multiple random seeds (e.g., 3-5 runs) and report mean performance along with standard deviations for both the RefCOCO variants and OSAR benchmark. We will also explicitly detail the baseline implementations, confirming that they were re-trained using the same hyperparameters, training data splits, and optimization settings as our StateVLM model to ensure fair comparison. This will provide stronger evidence for the robustness of the observed gains. revision: yes

  2. Referee: [Method] Training strategy description (ARL integration): no results are provided on held-out general VLM tasks such as VQA or image captioning after ARL fine-tuning. This is load-bearing for the central claim that ARL can be added without introducing distribution shift that degrades core sequence-prediction capabilities.

    Authors: We agree that preserving general VLM capabilities is crucial for the claim. In the revision, we will add evaluations on held-out tasks including VQA (e.g., VQAv2) and image captioning (e.g., COCO Captions) using the fine-tuned StateVLM model. We expect minimal degradation since ARL is an auxiliary loss applied only during fine-tuning on the box decoder outputs, and inference remains unchanged as standard next-token prediction. These results will demonstrate that the core sequence-prediction abilities are retained. revision: yes

  3. Referee: [Method] ARL formulation (loss weighting): the single free hyperparameter (ARL loss weight) is introduced without ablation or sensitivity analysis, leaving open whether the reported gains depend on a narrow choice of this parameter.

    Authors: We will include a sensitivity analysis and ablation study on the ARL loss weight in the revised manuscript. Specifically, we will test a range of weight values (e.g., 0.1, 0.5, 1.0, 2.0) and report performance on both RefCOCO and OSAR to show how the gains vary and to justify our chosen default value. This will address concerns about hyperparameter sensitivity. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical gains shown via direct held-out comparisons

full rationale

The paper's core contribution is an auxiliary regression loss (ARL) added only during fine-tuning of a VLM, with standard next-token prediction retained at inference. Performance is measured by direct comparison of models trained with vs. without ARL on the external RefCOCO family benchmarks and the newly introduced OSAR benchmark. No equations reduce a claimed prediction to a fitted parameter by construction, no load-bearing self-citations justify uniqueness or ansatzes, and no derivation chain collapses to renaming or self-definition. The reported 1.6% and 5.2% lifts are therefore independent empirical outcomes rather than tautological restatements of the training procedure.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard assumption that VLMs can be fine-tuned with an auxiliary loss term while leaving inference unchanged; the only free parameter introduced is the relative weight of the Auxiliary Regression Loss, which must be chosen or tuned.

free parameters (1)
  • ARL loss weight
    The scalar balancing the auxiliary regression term against the primary language-modeling loss is a tunable hyperparameter whose value is not reported in the abstract.
axioms (1)
  • domain assumption Box decoder outputs can be used as regression targets without altering the model's language generation behavior at inference time
    Invoked when the paper states that ARL is applied only during fine-tuning while preserving standard sequence prediction.

pith-pipeline@v0.9.0 · 5622 in / 1438 out tokens · 114655 ms · 2026-05-07T17:37:26.756662+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 6 canonical work pages · 3 internal anchors

  1. [1]

    W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo, T. Lu, J. Zhou, Y. Qiao, J. Dai, VisionLLM: large language model is also an open-ended decoder for vision-centric tasks, in: Proceedings of the 37th International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA, 2023, pp. 61501 – 61513

  2. [2]

    Q. Guo, S. De Mello, H. Yin, W. Byeon, K. C. Cheung, Y. Yu, P. Luo, S. Liu, RegionGPT: Towards region understanding vision lan- guage model, in: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024, pp. 13796–13806

  3. [3]

    Antol, A

    S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, D.Parikh, VQA:Visualquestionanswering, in: ProceedingsoftheIEEE International Conference on Computer Vision (ICCV), IEEE, 2015, pp. 2425–2433

  4. [4]

    Kazemzadeh, V

    S. Kazemzadeh, V. Ordonez, M. Matten, T. Berg, ReferItGame: Re- ferring to objects in photographs of natural scenes, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 787–798

  5. [5]

    J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, K. Murphy, Generation and comprehension of unambiguous object descriptions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 11–20

  6. [6]

    Z. Lin, D. Liu, R. Zhang, P. Gao, L. Qiu, H. Xiao, H. Qiu, W. Shao, K. Chen, J. Han, S. Huang, Y. Zhang, X. He, Y. Qiao, H. Li, SPHINX: A mixer of weights, visual embeddings and image scales for multimodal large language models, in: European Conference on Computer Vision (ECCV), LXII, Springer-Verlag, Berlin, Heidelberg, 2024, pp. 36–55

  7. [7]

    K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, R. Zhao, Shikra: Un- leashing multimodal LLM’s referential dialogue magic, arXiv preprint arXiv:2306.15195 (2023)

  8. [8]

    J. Chen, D. Zhu, X. Shen, X. Li, Z. Liu, P. Zhang, R. Krishnamoor- thi, V. Chandra, Y. Xiong, M. Elhoseiny, Minigpt-v2: large language 24 model as a unified interface for vision-language multi-task learning, arXiv preprint arXiv:2310.09478 (2023)

  9. [9]

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, J. Zhou, Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond, arXiv preprint arXiv:2308.12966 (2023)

  10. [10]

    Zhang, H

    H. Zhang, H. Li, F. Li, T. Ren, X. Zou, S. Liu, S. Huang, J. Gao, Leizhang, C. Li, J. Yang, LLaVA-grounding: Grounded visual chat with large multimodal models, in: A. Leonardis, E. Ricci, S. Roth, O. Rus- sakovsky, T. Sattler, G. Varol (Eds.), European Conference on Com- puter Vision, Part XLIII, Springer-Verlag, Milan, Italy, 2024, pp. 19–35

  11. [11]

    Pramanick, G

    S. Pramanick, G. Han, R. Hou, S. Nag, S.-N. Lim, N. Ballas, Q. Wang, R. Chellappa, A. Almahairi, Jack of all tasks, master of many: De- signing general-purpose coarse-to-fine vision-language model, in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14076–14088

  12. [12]

    H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.- F. Chang, Y. Yang, Ferret: Refer and ground anything anywhere at any granularity, in: The Twelfth International Conference on Learning Representations, 2024

  13. [13]

    Zhang, H

    H. Zhang, H. You, P. Dufter, B. Zhang, C. Chen, H.-Y. Chen, T.-J. Fu, W. Y. Wang, S.-F. Chang, Z. Gan, Y. Yang, Ferret-v2: An improved baseline for referring and grounding with large language models, in: First Conference on Language Modeling, 2024

  14. [14]

    W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang, J. Ji, et al., CogVLM: Visual expert for pretrained language models, in: A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, C. Zhang (Eds.), Advances in Neural Information Processing Systems, Vol. 37, Curran Associates, Inc., Red Hook, NY, USA, 2024, pp. 121475– 121499

  15. [15]

    T. Chen, S. Saxena, L. Li, D. J. Fleet, G. Hinton, Pix2seq: A language modeling framework for object detection, in: International Conference on Learning Representations, 2022. 25

  16. [16]

    Zhang, Y

    A. Zhang, Y. Yao, W. Ji, Z. Liu, T.-S. Chua, NExT-Chat: an LMM for chat, detection, and segmentation, in: Proceedings of the 41st Inter- national Conference on Machine Learning, JMLR.org, Vienna, Austria, 2024, pp. 60116–60133

  17. [17]

    2, 2024, pp

    F.Gouidis, K.E.Papoutsakis, T.Patkos, A.A.Argyros, D.Plexousakis, Exploring the impact of knowledge graphs on zero-shot visual object state classification, in: Proceedings of the International Joint Confer- ence on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP), Vol. 2, 2024, pp. 738–749

  18. [18]

    A. B. Jelodar, Y. Sun, Joint object and state recognition using language knowledge, in: 2019 IEEE International Conference onImage Processing (ICIP), 2019, pp. 3352–3356

  19. [19]

    Spisak, M

    J. Spisak, M. Kerzel, S. Wermter, Clarifying the half full or half empty question: Multimodal container classification, in: International Confer- ence on Artificial Neural Networks, Springer, 2023, pp. 444–456

  20. [20]

    Gouidis, T

    F. Gouidis, T. Patkos, A. Argyros, D. Plexousakis, Detecting object states vs detecting objects: A new dataset and a quantitative experi- mental study, in: Proceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Appli- cations (VISAPP), Vol. 5, 2022, pp. 590–600

  21. [21]

    X. Li, S. Huang, Q. Yu, Z. Jiang, C. Hao, Y. Zhu, H. Li, P. Gao, C. Lu, SKT: integrating state-aware keypoint trajectories with vision- language models for robotic garment manipulation, in: IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems, IROS 2025, IEEE, Hangzhou, China, 2025, pp. 18828–18833

  22. [22]

    Chuang, A

    N. Nguyen, J. Bi, A. Vosoughi, Y. Tian, P. Fazli, C. Xu, OSCaR: Object state captioning and state change representation, in: K. Duh, H. Gomez, S. Bethard (Eds.), Findings of the Association for Computational Lin- guistics: NAACL2024, AssociationforComputationalLinguistics, Mex- ico, 2024, pp. 3565–3576.doi:10.18653/v1/2024.findings-naacl. 226

  23. [23]

    Z. Liu, W. T. Freeman, J. B. Tenenbaum, J. Wu, Physical primitive de- composition, in: Proceedings of the European Conference on Computer Vision (ECCV), Part XII, 2018, pp. 3–20. 26

  24. [24]

    J.J.Gibson, Thetheoryofaffordances, in: R.Shaw, J.Bransford(Eds.), Perceiving, Acting, and Knowing: Toward an Ecological Psychology, Lawrence Erlbaum Associates, Inc., Hillsdale, New Jersey, 1977, pp. 67–82

  25. [25]

    Grauman, A

    K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al., Ego4D: Around the world in 3,000 hours of egocentric video, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18995–19012

  26. [26]

    Y. Yang, H. Yu, X. Lou, Y. Liu, C. Choi, Attribute-based robotic grasp- ing with data-efficient adaptation, IEEE Transactions on Robotics 40 (2024) 1566–1579

  27. [27]

    P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, H. Yang, OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework, in: International Con- ference on Machine Learning, PMLR, 2022, pp. 23318–23340

  28. [28]

    J. Deng, Z. Yang, T. Chen, W. Zhou, H. Li, TransVG: End-to-end visual grounding with transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1769–1779

  29. [29]

    Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. Liu, UNITER: Universal image-text representation learning, in: Eu- ropean Conference on Computer Vision, Springer, 2020, pp. 104–120

  30. [30]

    Gan, Y.-C

    Z. Gan, Y.-C. Chen, L. Li, C. Zhu, Y. Cheng, J. Liu, Large-scale adver- sarial training for vision-and-language representation learning, Advances in Neural Information Processing Systems 33 (2020) 6616–6628

  31. [31]

    Z. Yang, Z. Gan, J. Wang, X. Hu, F. Ahmed, Z. Liu, Y. Lu, L. Wang, UniTAB: Unifying text and box outputs for grounded vision-language modeling, in: EuropeanConferenceonComputerVision, Springer, 2022, pp. 521–539

  32. [32]

    S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, et al., Grounding DINO: Marrying dino with grounded pre-training for open-set object detection, in: European Conference on Computer Vision, Springer, 2024, pp. 38–55. 27

  33. [33]

    Wohlin, P

    C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, A. Wesslén, Experimentation in Software Engineering, Vol. 236, Springer, Berlin and Heidelberg, Germany, 2012

  34. [34]

    D. A. Norman, The psychology of everyday things, Basic books, New York, 1988

  35. [35]

    Y. Yang, X. Lou, C. Choi, Interactive robotic grasping with attribute- guided disambiguation, in: 2022 International Conference on Robotics and Automation (ICRA), IEEE, 2022, pp. 8914–8920

  36. [36]

    S. Yu, K. Lin, A. Xiao, J. Duan, H. Soh, Octopi: Object property reasoning with large tactile-language models, in: Robotics: Science and Systems, Delft, Netherlands, 2024

  37. [37]

    Huang, H

    S. Huang, H. Chang, Y. Liu, Y. Zhu, H. Dong, A. Boularias, P. Gao, H. Li, A3VLM: Actionable articulation-aware vision language model, in: P. Agrawal, O. Kroemer, W. Burgard (Eds.), Proceedings of The 8th Conference on Robot Learning, Vol. 270, PMLR, Munich, Germany, 2025, pp. 1675–1690

  38. [38]

    X. Li, M. Zhang, Y. Geng, H. Geng, Y. Long, Y. Shen, R. Zhang, J. Liu, H. Dong, ManipLLM: Embodied multimodal large language model for object-centric robotic manipulation, in: Computer Vision and Pattern Recognition, 2024, pp. 18061–18070

  39. [39]

    Huang, I

    S. Huang, I. Ponomarenko, Z. Jiang, X. Li, X. Hu, P. Gao, H. Li, H. Dong, ManipVQA: Injecting robotic affordance and physically grounded information into multimodal large language models, in: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2024, pp. 7580–7587

  40. [40]

    D. Guo, Y. Xiang, S. Zhao, X. Zhu, M. Tomizuka, M. Ding, W. Zhan, PhyGrasp: Generalizing robotic grasping with physics-informed large multimodal models, in: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025, pp. 14915–14922

  41. [41]

    Booch, R

    G. Booch, R. A. Maksimchuk, M. W. Engle, B. J. Young, J. Connallen, K. A. Houston, Object-oriented analysis and design with applications, ACM SIGSOFT Software Engineering Notes 33 (5) (2008) 29–29. 28

  42. [42]

    Manousaki, K

    V. Manousaki, K. Bacharidis, F. Gouidis, K. Papoutsakis, D. Plex- ousakis, A. Argyros, Anticipating object state changes, arXiv preprint arXiv:2405.12789 (2024)

  43. [43]

    X. Sun, X. Zhao, J. H. Lee, W. Lu, M. Kerzel, S. Wermter, Details make a difference: Object state-sensitive neurorobotic task planning, in: International Conference on Artificial Neural Networks, Springer, 2024, pp. 261–275

  44. [44]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High- Resolution Image Synthesis with Latent Diffusion Models , in: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society, Los Alamitos, CA, USA, 2022, pp. 10674–10685

  45. [45]

    Rezatofighi, N

    H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, S. Savarese, Generalized intersection over union: A metric and a loss for bounding box regression, in: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2019, pp. 658–666

  46. [46]

    Y. Yao, T. Yu, A. Zhang, C. Wang, J. Cui, H. Zhu, T. Cai, C. Chen, H. Li, W. Zhao, et al., Efficient gpt-4v level multimodal large language model for deployment on edge devices, Nature Communications 16 (1) (2025) 5509

  47. [47]

    X. Zhai, B. Mustafa, A. Kolesnikov, L. Beyer, Sigmoid loss for lan- guage image pre-training, in: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2023, pp. 11975–11986

  48. [48]

    Qwen2 Technical Report

    Q. Team, et al., Qwen2 technical report, arXiv preprint arXiv:2407.10671 (2024)

  49. [49]

    Brooks, J

    J. Brooks, J. Dulá, The L1-norm best-fit hyperplane problem, Applied Mathematics Letters 26 (1) (2013) 51–55

  50. [50]

    N.Carion, F.Massa, G.Synnaeve, N.Usunier, A.Kirillov, S.Zagoruyko, End-to-end object detection with transformers, in: European Confer- ence on Computer Vision, Springer, 2020, pp. 213–229. 29

  51. [51]

    E. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, LoRA: Low-rank adaptation of large language models, in: International Conference on Learning Representations, 2022

  52. [52]

    L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, T. L. Berg, MAttNet: Modular attention network for referring expression comprehension, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1307–1315. Appendix A. Appendix We summarize the configurations of VLMs capable of performing the REC task in Table A.5, inclu...