Recognition: unknown
StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning
Pith reviewed 2026-05-07 17:37 UTC · model grok-4.3
The pith
An auxiliary regression loss computed from box decoder outputs during fine-tuning improves vision-language models' numerical reasoning for precise object and state localization in robotic affordance tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
StateVLM adapts vision-language models by computing an Auxiliary Regression Loss from box decoder outputs during fine-tuning to improve numerical reasoning in object detection, object-state localization, and graspable region identification, while retaining standard sequence prediction at inference. On adapted RefCOCO benchmarks this yields an average 1.6 percent gain, and on the introduced OSAR benchmark of 1,172 scenes the model with the loss outperforms versions without it by an average of 5.2 percent, with particular gains in consistency for affordance reasoning.
What carries the argument
Auxiliary Regression Loss (ARL) computed from box decoder outputs during fine-tuning, serving as an extra training signal that strengthens numerical accuracy for localization without changing the sequence prediction used at test time.
If this is right
- Models trained with ARL achieve higher accuracy on referring expression comprehension tasks such as RefCOCO, RefCOCO+, and RefCOCOg.
- The performance lift is larger on the OSAR benchmark, particularly for the complex affordance reasoning task.
- ARL improves the consistency of model outputs when reasoning about object states and graspable regions.
- StateVLM learns fine-grained representations that support both localization and graspability for robotic use.
Where Pith is reading between the lines
- The training approach could extend to other vision-language models to support more quantitative instructions in robotic planning without requiring new architectures.
- The OSAR benchmark offers a public test set that future models can use to measure progress on state-aware affordance tasks.
- Combining regression signals with language objectives may help address similar numerical weaknesses in other multimodal settings.
Load-bearing premise
That adding the auxiliary regression loss during training will integrate without causing instabilities or degrading the model's behavior on non-regression tasks.
What would settle it
A training run in which StateVLM with the auxiliary loss shows lower accuracy than the version without it on the OSAR benchmark, or exhibits unstable convergence, would disprove the claimed benefit.
Figures
read the original abstract
Vision-language models (VLMs) have shown remarkable performance in various robotic tasks, as they can perceive visual information and understand natural language instructions. However, when applied to robotics, VLMs remain subject to a fundamental limitation inherent in large language models (LLMs): they struggle with numerical reasoning, particularly in object detection and object-state localization. To explore numerical reasoning as a regression task in VLMs, we propose a novel training strategy to adapt VLMs for object detection and object-state localization. This approach leverages box decoder outputs to compute an Auxiliary Regression Loss (ARL) during fine-tuning, while preserving standard sequence prediction at inference. We leverage this training strategy to develop StateVLM (State-aware Vision-Language Model), a novel model designed to perceive and learn fine-grained object representations, including precise localization of objects and their states, as well as graspable regions. Due to the lack of a benchmark for object-state affordance reasoning, we introduce an open-source benchmark, Object State Affordance Reasoning (OSAR), which contains 1,172 scenes with 7,746 individual objects and corresponding bounding boxes. Comparative experiments on adapted benchmarks (RefCOCO, RefCOCO+, and \mbox{RefCOCOg}) demonstrate that ARL improves model performance by an average of 1.6\% compared to models without ARL. Experiments on the OSAR benchmark further support this finding, showing that StateVLM with ARL achieves an average of 5.2\% higher performance than models without ARL. In particular, ARL is also important for the complex task of affordance reasoning in OSAR, where it enhances the consistency of model outputs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes StateVLM, a vision-language model for robotic affordance reasoning that incorporates an Auxiliary Regression Loss (ARL) computed from box decoder outputs during fine-tuning to improve numerical reasoning for object detection, state localization, and graspable regions. Standard next-token prediction is retained at inference. The authors introduce the OSAR benchmark (1,172 scenes, 7,746 objects) and report that ARL yields average gains of 1.6% on adapted RefCOCO/RefCOCO+/RefCOCOg and 5.2% on OSAR relative to baselines without ARL, with particular benefit for consistency in complex affordance tasks.
Significance. If the ARL approach can be shown not to degrade general VLM capabilities, it would provide a lightweight, inference-preserving method for injecting regression-style numerical reasoning into VLMs, addressing a known limitation in robotic applications. The OSAR benchmark is a useful contribution for evaluating object-state affordance. The current empirical support is moderate, resting on gains whose reliability is not fully established by the reported details.
major comments (3)
- [Experiments] Experiments section (comparative results on RefCOCO variants and OSAR): the reported 1.6% and 5.2% average improvements lack any mention of statistical significance testing, standard deviation across multiple runs, or exact baseline implementation details (e.g., whether baselines were re-trained with identical hyperparameters). This undermines confidence that the gains are robust rather than sensitive to post-hoc choices.
- [Method] Training strategy description (ARL integration): no results are provided on held-out general VLM tasks such as VQA or image captioning after ARL fine-tuning. This is load-bearing for the central claim that ARL can be added without introducing distribution shift that degrades core sequence-prediction capabilities.
- [Method] ARL formulation (loss weighting): the single free hyperparameter (ARL loss weight) is introduced without ablation or sensitivity analysis, leaving open whether the reported gains depend on a narrow choice of this parameter.
minor comments (2)
- [Abstract] The abstract and method sections use 'average of 5.2% higher performance' without clarifying the exact metric aggregation (e.g., mean over which sub-tasks or scenes).
- [Benchmark] OSAR benchmark description would benefit from explicit discussion of how scenes were collected and annotated to allow assessment of potential biases.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our paper. We have carefully considered each point and provide detailed responses below. We agree that additional experiments and clarifications will strengthen the manuscript and plan to incorporate them in the revised version.
read point-by-point responses
-
Referee: [Experiments] Experiments section (comparative results on RefCOCO variants and OSAR): the reported 1.6% and 5.2% average improvements lack any mention of statistical significance testing, standard deviation across multiple runs, or exact baseline implementation details (e.g., whether baselines were re-trained with identical hyperparameters). This undermines confidence that the gains are robust rather than sensitive to post-hoc choices.
Authors: We acknowledge the importance of statistical rigor in reporting results. In the revised manuscript, we will conduct additional experiments with multiple random seeds (e.g., 3-5 runs) and report mean performance along with standard deviations for both the RefCOCO variants and OSAR benchmark. We will also explicitly detail the baseline implementations, confirming that they were re-trained using the same hyperparameters, training data splits, and optimization settings as our StateVLM model to ensure fair comparison. This will provide stronger evidence for the robustness of the observed gains. revision: yes
-
Referee: [Method] Training strategy description (ARL integration): no results are provided on held-out general VLM tasks such as VQA or image captioning after ARL fine-tuning. This is load-bearing for the central claim that ARL can be added without introducing distribution shift that degrades core sequence-prediction capabilities.
Authors: We agree that preserving general VLM capabilities is crucial for the claim. In the revision, we will add evaluations on held-out tasks including VQA (e.g., VQAv2) and image captioning (e.g., COCO Captions) using the fine-tuned StateVLM model. We expect minimal degradation since ARL is an auxiliary loss applied only during fine-tuning on the box decoder outputs, and inference remains unchanged as standard next-token prediction. These results will demonstrate that the core sequence-prediction abilities are retained. revision: yes
-
Referee: [Method] ARL formulation (loss weighting): the single free hyperparameter (ARL loss weight) is introduced without ablation or sensitivity analysis, leaving open whether the reported gains depend on a narrow choice of this parameter.
Authors: We will include a sensitivity analysis and ablation study on the ARL loss weight in the revised manuscript. Specifically, we will test a range of weight values (e.g., 0.1, 0.5, 1.0, 2.0) and report performance on both RefCOCO and OSAR to show how the gains vary and to justify our chosen default value. This will address concerns about hyperparameter sensitivity. revision: yes
Circularity Check
No circularity; empirical gains shown via direct held-out comparisons
full rationale
The paper's core contribution is an auxiliary regression loss (ARL) added only during fine-tuning of a VLM, with standard next-token prediction retained at inference. Performance is measured by direct comparison of models trained with vs. without ARL on the external RefCOCO family benchmarks and the newly introduced OSAR benchmark. No equations reduce a claimed prediction to a fitted parameter by construction, no load-bearing self-citations justify uniqueness or ansatzes, and no derivation chain collapses to renaming or self-definition. The reported 1.6% and 5.2% lifts are therefore independent empirical outcomes rather than tautological restatements of the training procedure.
Axiom & Free-Parameter Ledger
free parameters (1)
- ARL loss weight
axioms (1)
- domain assumption Box decoder outputs can be used as regression targets without altering the model's language generation behavior at inference time
Reference graph
Works this paper leans on
-
[1]
W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo, T. Lu, J. Zhou, Y. Qiao, J. Dai, VisionLLM: large language model is also an open-ended decoder for vision-centric tasks, in: Proceedings of the 37th International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA, 2023, pp. 61501 – 61513
2023
-
[2]
Q. Guo, S. De Mello, H. Yin, W. Byeon, K. C. Cheung, Y. Yu, P. Luo, S. Liu, RegionGPT: Towards region understanding vision lan- guage model, in: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024, pp. 13796–13806
2024
-
[3]
Antol, A
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, D.Parikh, VQA:Visualquestionanswering, in: ProceedingsoftheIEEE International Conference on Computer Vision (ICCV), IEEE, 2015, pp. 2425–2433
2015
-
[4]
Kazemzadeh, V
S. Kazemzadeh, V. Ordonez, M. Matten, T. Berg, ReferItGame: Re- ferring to objects in photographs of natural scenes, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 787–798
2014
-
[5]
J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, K. Murphy, Generation and comprehension of unambiguous object descriptions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 11–20
2016
-
[6]
Z. Lin, D. Liu, R. Zhang, P. Gao, L. Qiu, H. Xiao, H. Qiu, W. Shao, K. Chen, J. Han, S. Huang, Y. Zhang, X. He, Y. Qiao, H. Li, SPHINX: A mixer of weights, visual embeddings and image scales for multimodal large language models, in: European Conference on Computer Vision (ECCV), LXII, Springer-Verlag, Berlin, Heidelberg, 2024, pp. 36–55
2024
-
[7]
K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, R. Zhao, Shikra: Un- leashing multimodal LLM’s referential dialogue magic, arXiv preprint arXiv:2306.15195 (2023)
work page internal anchor Pith review arXiv 2023
- [8]
-
[9]
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, J. Zhou, Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond, arXiv preprint arXiv:2308.12966 (2023)
work page internal anchor Pith review arXiv 2023
-
[10]
Zhang, H
H. Zhang, H. Li, F. Li, T. Ren, X. Zou, S. Liu, S. Huang, J. Gao, Leizhang, C. Li, J. Yang, LLaVA-grounding: Grounded visual chat with large multimodal models, in: A. Leonardis, E. Ricci, S. Roth, O. Rus- sakovsky, T. Sattler, G. Varol (Eds.), European Conference on Com- puter Vision, Part XLIII, Springer-Verlag, Milan, Italy, 2024, pp. 19–35
2024
-
[11]
Pramanick, G
S. Pramanick, G. Han, R. Hou, S. Nag, S.-N. Lim, N. Ballas, Q. Wang, R. Chellappa, A. Almahairi, Jack of all tasks, master of many: De- signing general-purpose coarse-to-fine vision-language model, in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14076–14088
2024
-
[12]
H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.- F. Chang, Y. Yang, Ferret: Refer and ground anything anywhere at any granularity, in: The Twelfth International Conference on Learning Representations, 2024
2024
-
[13]
Zhang, H
H. Zhang, H. You, P. Dufter, B. Zhang, C. Chen, H.-Y. Chen, T.-J. Fu, W. Y. Wang, S.-F. Chang, Z. Gan, Y. Yang, Ferret-v2: An improved baseline for referring and grounding with large language models, in: First Conference on Language Modeling, 2024
2024
-
[14]
W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang, J. Ji, et al., CogVLM: Visual expert for pretrained language models, in: A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, C. Zhang (Eds.), Advances in Neural Information Processing Systems, Vol. 37, Curran Associates, Inc., Red Hook, NY, USA, 2024, pp. 121475– 121499
2024
-
[15]
T. Chen, S. Saxena, L. Li, D. J. Fleet, G. Hinton, Pix2seq: A language modeling framework for object detection, in: International Conference on Learning Representations, 2022. 25
2022
-
[16]
Zhang, Y
A. Zhang, Y. Yao, W. Ji, Z. Liu, T.-S. Chua, NExT-Chat: an LMM for chat, detection, and segmentation, in: Proceedings of the 41st Inter- national Conference on Machine Learning, JMLR.org, Vienna, Austria, 2024, pp. 60116–60133
2024
-
[17]
2, 2024, pp
F.Gouidis, K.E.Papoutsakis, T.Patkos, A.A.Argyros, D.Plexousakis, Exploring the impact of knowledge graphs on zero-shot visual object state classification, in: Proceedings of the International Joint Confer- ence on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP), Vol. 2, 2024, pp. 738–749
2024
-
[18]
A. B. Jelodar, Y. Sun, Joint object and state recognition using language knowledge, in: 2019 IEEE International Conference onImage Processing (ICIP), 2019, pp. 3352–3356
2019
-
[19]
Spisak, M
J. Spisak, M. Kerzel, S. Wermter, Clarifying the half full or half empty question: Multimodal container classification, in: International Confer- ence on Artificial Neural Networks, Springer, 2023, pp. 444–456
2023
-
[20]
Gouidis, T
F. Gouidis, T. Patkos, A. Argyros, D. Plexousakis, Detecting object states vs detecting objects: A new dataset and a quantitative experi- mental study, in: Proceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Appli- cations (VISAPP), Vol. 5, 2022, pp. 590–600
2022
-
[21]
X. Li, S. Huang, Q. Yu, Z. Jiang, C. Hao, Y. Zhu, H. Li, P. Gao, C. Lu, SKT: integrating state-aware keypoint trajectories with vision- language models for robotic garment manipulation, in: IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems, IROS 2025, IEEE, Hangzhou, China, 2025, pp. 18828–18833
2025
-
[22]
N. Nguyen, J. Bi, A. Vosoughi, Y. Tian, P. Fazli, C. Xu, OSCaR: Object state captioning and state change representation, in: K. Duh, H. Gomez, S. Bethard (Eds.), Findings of the Association for Computational Lin- guistics: NAACL2024, AssociationforComputationalLinguistics, Mex- ico, 2024, pp. 3565–3576.doi:10.18653/v1/2024.findings-naacl. 226
-
[23]
Z. Liu, W. T. Freeman, J. B. Tenenbaum, J. Wu, Physical primitive de- composition, in: Proceedings of the European Conference on Computer Vision (ECCV), Part XII, 2018, pp. 3–20. 26
2018
-
[24]
J.J.Gibson, Thetheoryofaffordances, in: R.Shaw, J.Bransford(Eds.), Perceiving, Acting, and Knowing: Toward an Ecological Psychology, Lawrence Erlbaum Associates, Inc., Hillsdale, New Jersey, 1977, pp. 67–82
1977
-
[25]
Grauman, A
K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al., Ego4D: Around the world in 3,000 hours of egocentric video, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18995–19012
2022
-
[26]
Y. Yang, H. Yu, X. Lou, Y. Liu, C. Choi, Attribute-based robotic grasp- ing with data-efficient adaptation, IEEE Transactions on Robotics 40 (2024) 1566–1579
2024
-
[27]
P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, H. Yang, OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework, in: International Con- ference on Machine Learning, PMLR, 2022, pp. 23318–23340
2022
-
[28]
J. Deng, Z. Yang, T. Chen, W. Zhou, H. Li, TransVG: End-to-end visual grounding with transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1769–1779
2021
-
[29]
Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. Liu, UNITER: Universal image-text representation learning, in: Eu- ropean Conference on Computer Vision, Springer, 2020, pp. 104–120
2020
-
[30]
Gan, Y.-C
Z. Gan, Y.-C. Chen, L. Li, C. Zhu, Y. Cheng, J. Liu, Large-scale adver- sarial training for vision-and-language representation learning, Advances in Neural Information Processing Systems 33 (2020) 6616–6628
2020
-
[31]
Z. Yang, Z. Gan, J. Wang, X. Hu, F. Ahmed, Z. Liu, Y. Lu, L. Wang, UniTAB: Unifying text and box outputs for grounded vision-language modeling, in: EuropeanConferenceonComputerVision, Springer, 2022, pp. 521–539
2022
-
[32]
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, et al., Grounding DINO: Marrying dino with grounded pre-training for open-set object detection, in: European Conference on Computer Vision, Springer, 2024, pp. 38–55. 27
2024
-
[33]
Wohlin, P
C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, A. Wesslén, Experimentation in Software Engineering, Vol. 236, Springer, Berlin and Heidelberg, Germany, 2012
2012
-
[34]
D. A. Norman, The psychology of everyday things, Basic books, New York, 1988
1988
-
[35]
Y. Yang, X. Lou, C. Choi, Interactive robotic grasping with attribute- guided disambiguation, in: 2022 International Conference on Robotics and Automation (ICRA), IEEE, 2022, pp. 8914–8920
2022
-
[36]
S. Yu, K. Lin, A. Xiao, J. Duan, H. Soh, Octopi: Object property reasoning with large tactile-language models, in: Robotics: Science and Systems, Delft, Netherlands, 2024
2024
-
[37]
Huang, H
S. Huang, H. Chang, Y. Liu, Y. Zhu, H. Dong, A. Boularias, P. Gao, H. Li, A3VLM: Actionable articulation-aware vision language model, in: P. Agrawal, O. Kroemer, W. Burgard (Eds.), Proceedings of The 8th Conference on Robot Learning, Vol. 270, PMLR, Munich, Germany, 2025, pp. 1675–1690
2025
-
[38]
X. Li, M. Zhang, Y. Geng, H. Geng, Y. Long, Y. Shen, R. Zhang, J. Liu, H. Dong, ManipLLM: Embodied multimodal large language model for object-centric robotic manipulation, in: Computer Vision and Pattern Recognition, 2024, pp. 18061–18070
2024
-
[39]
Huang, I
S. Huang, I. Ponomarenko, Z. Jiang, X. Li, X. Hu, P. Gao, H. Li, H. Dong, ManipVQA: Injecting robotic affordance and physically grounded information into multimodal large language models, in: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2024, pp. 7580–7587
2024
-
[40]
D. Guo, Y. Xiang, S. Zhao, X. Zhu, M. Tomizuka, M. Ding, W. Zhan, PhyGrasp: Generalizing robotic grasping with physics-informed large multimodal models, in: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025, pp. 14915–14922
2025
-
[41]
Booch, R
G. Booch, R. A. Maksimchuk, M. W. Engle, B. J. Young, J. Connallen, K. A. Houston, Object-oriented analysis and design with applications, ACM SIGSOFT Software Engineering Notes 33 (5) (2008) 29–29. 28
2008
-
[42]
V. Manousaki, K. Bacharidis, F. Gouidis, K. Papoutsakis, D. Plex- ousakis, A. Argyros, Anticipating object state changes, arXiv preprint arXiv:2405.12789 (2024)
-
[43]
X. Sun, X. Zhao, J. H. Lee, W. Lu, M. Kerzel, S. Wermter, Details make a difference: Object state-sensitive neurorobotic task planning, in: International Conference on Artificial Neural Networks, Springer, 2024, pp. 261–275
2024
-
[44]
Rombach, A
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High- Resolution Image Synthesis with Latent Diffusion Models , in: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society, Los Alamitos, CA, USA, 2022, pp. 10674–10685
2022
-
[45]
Rezatofighi, N
H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, S. Savarese, Generalized intersection over union: A metric and a loss for bounding box regression, in: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2019, pp. 658–666
2019
-
[46]
Y. Yao, T. Yu, A. Zhang, C. Wang, J. Cui, H. Zhu, T. Cai, C. Chen, H. Li, W. Zhao, et al., Efficient gpt-4v level multimodal large language model for deployment on edge devices, Nature Communications 16 (1) (2025) 5509
2025
-
[47]
X. Zhai, B. Mustafa, A. Kolesnikov, L. Beyer, Sigmoid loss for lan- guage image pre-training, in: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2023, pp. 11975–11986
2023
-
[48]
Q. Team, et al., Qwen2 technical report, arXiv preprint arXiv:2407.10671 (2024)
work page internal anchor Pith review arXiv 2024
-
[49]
Brooks, J
J. Brooks, J. Dulá, The L1-norm best-fit hyperplane problem, Applied Mathematics Letters 26 (1) (2013) 51–55
2013
-
[50]
N.Carion, F.Massa, G.Synnaeve, N.Usunier, A.Kirillov, S.Zagoruyko, End-to-end object detection with transformers, in: European Confer- ence on Computer Vision, Springer, 2020, pp. 213–229. 29
2020
-
[51]
E. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, LoRA: Low-rank adaptation of large language models, in: International Conference on Learning Representations, 2022
2022
-
[52]
L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, T. L. Berg, MAttNet: Modular attention network for referring expression comprehension, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1307–1315. Appendix A. Appendix We summarize the configurations of VLMs capable of performing the REC task in Table A.5, inclu...
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.