Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning

Allison Andreyev; Landon Eum; Nestor Tiglao; Romel Gomez

arxiv: 2606.12910 · v1 · pith:HW6XOPHInew · submitted 2026-06-11 · 💻 cs.RO · cs.AI· cs.CV· cs.SY· eess.SY

Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning

Allison Andreyev , Landon Eum , Nestor Tiglao , Romel Gomez This is my paper

Pith reviewed 2026-06-27 06:45 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.SYeess.SY

keywords language-conditioned graspingneuro-symbolic planningvision-language modelsbounding boxestabletop manipulationzero-shot generalizationrobot task planning

0 comments

The pith

GRASP maps natural-language queries to bounding-box goals for robot grasping without task-specific training

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GRASP, a framework that converts natural-language prompts into neuro-symbolic goal states represented as bounding boxes using a pretrained vision-language model. These boxes are grounded in the physical scene through an object detection pipeline, allowing the robot to handle abstract spatial relations without fixed color lists, hardcoded coordinates, or any fine-tuning. A sympathetic reader would care because current methods are either computationally heavy or demand thousands of demonstrations, while this approach aims for lightweight, real-time adaptation in household or industrial settings. Real-robot experiments across three difficulty levels provide the supporting evidence.

Core claim

GRASP translates natural-language queries into bounding-box goal states via a pretrained VLM, grounds them physically through detection, and executes them with neuro-symbolic planning, achieving 73.3 percent success across 90 real-robot trials at three difficulty levels with no task-specific training.

What carries the argument

The bounding-box detection pipeline that converts VLM language outputs into grounded physical goal states for symbolic planning

If this is right

Robots interpret abstract spatial concepts such as top shelf without task-specific engineering.
The system generalizes zero-shot to new objects and instructions.
Planning stays lightweight compared with end-to-end trained models that require large demonstration sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the pipeline to mobile bases or 3D scenes would test whether bounding-box goals remain sufficient outside tabletop settings.
Swapping the VLM or detection model could quantify robustness without retraining the planner.
Sequencing multiple language commands would require adding temporal logic on top of the current single-goal mechanism.

Load-bearing premise

A pretrained VLM can reliably translate natural-language queries into accurate bounding-box goal states that align with objects detected in the physical world.

What would settle it

Trials in which ambiguous language or detection failures produce incorrect bounding boxes and drive the overall success rate well below 73 percent.

Figures

Figures reproduced from arXiv: 2606.12910 by Allison Andreyev, Landon Eum, Nestor Tiglao, Romel Gomez.

**Figure 1.** Figure 1: Key Parameters. Given a natural language instruction, GRASP extracts objects of interest which are detected via a pretrained VLM (GroundingDINO). The resulting goal state serves as both a symbolic representation of the desired configuration and a long-term task completion signal for closedloop execution. states and grounds them using a pretrained VLM. A closedloop control pipeline then aligns robot acti… view at source ↗

**Figure 2.** Figure 2: User input and system prompt pipeline. An LLM parses user input into i) a JSON file representing a goal state, ii) objects of interest used as G.DINO candidates, and iii) a bounding box visualization for the goal state. condition control policies on language embeddings through imitation or reinforcement learning [4], [10], [15], [20], [26], [29]. While effective, these approaches typically require largesc… view at source ↗

**Figure 3.** Figure 3: GRASP Pipeline. At each timestep, shelf and end-effector camera frames are processed by GroundingDINO to produce labeled bounding box detections, serialized into a JSON scene state. The goal similarity module compares the current scene to the LLM-generated goal state via IoU and center distance. If similarity exceeds a threshold or no objects are detected for several consecutive frames, the task terminates… view at source ↗

**Figure 4.** Figure 4: Experimental Setup. We conduct 9 total grasping experiments across 3 levels of difficulty, with each difficulty level having 3 distinct tasks. IV. EXPERIMENT We design several modular experiments to evaluate GRASP’s performance across goal state generation, closedloop alignment, and efficiency. A separate ablation study is also conducted as follows: motion planning without smoothing & deadband, open-loop… view at source ↗

**Figure 5.** Figure 5: Likert scale evaluation results from the user study. The table represents the percentage distribution of ratings (1-5), or ”Strongly Disagree” to ”Strongly Agree,” across 4 categories: a) Two-Group Cross Constraints, b) Triple-Attribute Filtering, c) Three-Way Spatial Partition, and d) Overlapping Constraints. of 1.14, indicating overall positive user agreement with the generated goal-state representations… view at source ↗

read the original abstract

For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although Vision-Language Models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally "heavyweight" or require extensive training on thousands of demonstrations. We present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, grounded in the physical world via a bounding-box detection pipeline. Unlike methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts such as "top shelf" and execute tasks without additional fine-tuning. We achieve 73.3% overall success across 90 real-robot trials at three difficulty levels, requiring no task-specific training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRASP frames VLM outputs as bounding-box goals for neuro-symbolic planning and reports 73% success on 90 real-robot trials with no task-specific training, but the abstract leaves the grounding pipeline and baselines underspecified.

read the letter

The main thing here is a clean framing: take a pretrained VLM, have it output bounding boxes from language, then feed those boxes as explicit goals into a neuro-symbolic planner for tabletop grasping. That avoids both heavy end-to-end training and hand-coded color lists, and the abstract positions it as a step toward open-vocabulary manipulation. The 73.3% success across three difficulty levels on 90 real-robot trials is the concrete number they lead with, and they emphasize zero task-specific training.

What stands out is the explicit use of bounding boxes as the interface between the VLM and the planner. That seems more structured than just prompting a VLM for actions directly, and it lets them handle spatial phrases like "top shelf" without extra fine-tuning. The real-robot evaluation is also a plus; many similar papers stay in simulation.

The soft spots are mostly around missing detail. The abstract gives an overall success rate but no breakdown by difficulty, no baselines, no error bars, and no description of failure modes or how the bounding-box detection is actually grounded in the scene. The claim that the VLM reliably produces usable goal states rests on the pipeline working as described, yet we get no numbers on detection accuracy or how often the planner receives bad boxes. Without those, it's hard to judge whether the 73% reflects the method or just favorable test conditions.

This is for people working on language-conditioned TAMP who want a lighter alternative to large demonstration datasets. If the full paper supplies the missing experimental protocol and a clear comparison to prior neuro-symbolic or VLM baselines, it would be worth a serious referee. Based on the abstract alone, it is not yet clear whether the central empirical claim holds up under scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper presents the GRASP framework for language-conditioned tabletop grasping. A pretrained VLM translates natural-language queries into bounding-box goal states that are grounded via a detection pipeline; these goals then drive a neuro-symbolic planner. The central empirical claim is a 73.3% success rate across 90 real-robot trials spanning three difficulty levels, achieved with no task-specific training or fine-tuning.

Significance. If the experimental results can be substantiated, the work would be significant for open-vocabulary robotic manipulation. It offers a lightweight alternative to end-to-end learned policies by combining VLMs with symbolic planning to handle abstract spatial language without large demonstration datasets or fixed vocabularies.

major comments (2)

[Abstract] Abstract: The reported 73.3% success rate over 90 trials is stated without any description of the experimental protocol, choice of baselines, error bars, statistical tests, or failure-mode breakdown. This information is required to evaluate the central claim of reliable zero-shot performance.
[Abstract] Abstract: The assertion that the system requires 'no task-specific training' rests on the unexamined assumption that a pretrained VLM produces accurate, physically grounded bounding-box goals; no quantitative validation, ablation, or error analysis of the VLM-to-bounding-box step is supplied to support this load-bearing premise.

minor comments (1)

[Abstract] Abstract: The expansion of the acronym GRASP is not given, which reduces immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the abstract. We address each major comment below, clarifying what is already in the full manuscript and indicating where revisions will strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The reported 73.3% success rate over 90 trials is stated without any description of the experimental protocol, choice of baselines, error bars, statistical tests, or failure-mode breakdown. This information is required to evaluate the central claim of reliable zero-shot performance.

Authors: The abstract is intentionally concise. The full manuscript details the experimental protocol (90 trials across three difficulty levels), failure-mode breakdown, and success rates in the Experiments section. No baselines are included because the work presents a zero-shot framework rather than a comparative study; no error bars or statistical tests are reported as the evaluation is a feasibility demonstration on a real robot. We will revise the abstract to briefly reference the evaluation setup and note the zero-shot nature of the approach. revision: yes
Referee: [Abstract] Abstract: The assertion that the system requires 'no task-specific training' rests on the unexamined assumption that a pretrained VLM produces accurate, physically grounded bounding-box goals; no quantitative validation, ablation, or error analysis of the VLM-to-bounding-box step is supplied to support this load-bearing premise.

Authors: The 'no task-specific training' claim means the VLM and neuro-symbolic planner are used off-the-shelf without fine-tuning on robot demonstrations or task data. The bounding-box grounding occurs via a separate detection pipeline. While the manuscript does not contain an isolated quantitative ablation of VLM bounding-box accuracy, the end-to-end real-robot success rate provides supporting evidence for the overall pipeline. We will add a short discussion of VLM error modes and their contribution to failures in the revised version. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical result stands alone

full rationale

The paper presents an empirical robotics framework (GRASP) that maps language queries to bounding-box goals via a pretrained VLM and then executes via neuro-symbolic planning. The central claim is a measured 73.3% success rate over 90 real-robot trials with no task-specific training. No equations, fitted parameters, or derived quantities appear in the provided text; the success rate is reported as a direct experimental outcome rather than a prediction or theorem. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify any derivation. The pipeline is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review based on abstract only; limited visibility into full set of assumptions or parameters.

axioms (1)

domain assumption Pretrained VLMs generalize to abstract spatial concepts such as 'top shelf' without fine-tuning on robot data.
Invoked as the mechanism enabling zero-shot performance in the abstract.

invented entities (1)

GRASP framework no independent evidence
purpose: Integrate VLM translation with bounding-box grounding and symbolic planning for language-conditioned grasping
New named system presented in the abstract; no independent evidence provided beyond the reported trials.

pith-pipeline@v0.9.1-grok · 5713 in / 1203 out tokens · 20857 ms · 2026-06-27T06:45:05.697893+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 3 canonical work pages

[1]

Cohen, J

V . Cohen, J. X. Liu, R. Mooney, S. Tellex, and D. Watkins. A survey of robotic language grounding: Tradeoffs between symbols and embeddings.arXiv preprint arXiv:2405.13245, 2024

arXiv 2024
[2]

C. Cui, C. Zhu, C. Oh, and A. Cavallaro. Improving generalization of language-conditioned robot manipulation. In2025 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pp. 13178– 13184. IEEE, 2025

2025
[3]

M. U. Din, W. Akram, L. S. Saoud, J. Rosell, and I. Hussain. Vision language action models in robotic manipulation: A systematic review. arXiv preprint arXiv:2507.10672, 2025

arXiv 2025
[4]

R. Gong, X. Gao, Q. Gao, S. Shakiah, G. Thattai, and G. S. Sukhatme. Lemma: Learning language-conditioned multi-robot manipulation.IEEE Robotics and Automation Letters, 8(10):6835–6842, 2023. doi: 10.1109/ LRA.2023.3313058

arXiv 2023
[5]

H. Guo, F. Wu, Y . Qin, R. Li, K. Li, and K. Li. Recent trends in task and motion planning for robotics: A survey.ACM Computing Surveys, 55(13s):1–36, 2023

2023
[6]

Herzog, J

J. Herzog, J. Liu, and Y . Wang. Domain-conditioned scene graphs for state-grounded task planning.arXiv preprint arXiv:2504.06661, 2025

arXiv 2025
[7]

Huang, X

H. Huang, X. Chen, Y . Chen, H. Li, X. Han, Z. Wang, T. Wang, J. Pang, and Z. Zhao. Roboground: Robotic manipulation with grounded vision- language priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 22540–22550, 2025

2025
[8]

Huang, A

J. Huang, A. Sethi, M. Kuo, M. Keoliya, N. Velingker, J. Jung, S.-N. Lim, Z. Li, and M. Naik. Esca: Contextualizing embodied agents via scene-graph generation.arXiv preprint arXiv:2510.15963, 2025

arXiv 2025
[9]

Huang, C

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models, 2023

2023
[10]

M. Jia, H. Huang, Z. Zhang, C. Wang, L. Zhao, D. Wang, J. X. Liu, R. Walters, R. Platt, and S. Tellex. Learning efficient and robust language-conditioned manipulation using textual-visual relevancy and equivariant language mapping, 2025

2025
[11]

Q. Li, Y . Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y . Deng, S. Xu, Y . Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650, 2024

Pith/arXiv arXiv 2024
[12]

R. Likert. A technique for the measurement of attitudes.Archives of psychology, 1932

1932
[13]

J. Liu, M. Liu, Z. Wang, P. An, X. Li, K. Zhou, S. Yang, R. Zhang, Y . Guo, and S. Zhang. Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024

2024
[14]

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2024

2024
[15]

O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022. doi: 10.1109/LRA.2022.3180108

work page doi:10.1109/lra.2022.3180108 2022
[16]

A. Ray, J. Arkin, H. Biggie, C. Fan, L. Carlone, and N. Roy. Structured interfaces for automated reasoning with 3d scene graphs.arXiv preprint arXiv:2510.16643, 2025

arXiv 2025
[17]

R. Shao, W. Li, L. Zhang, R. Zhang, Z. Liu, R. Chen, and L. Nie. Large vlm-based vision-language-action models for robotic manipulation: A survey.arXiv preprint arXiv:2508.13073, 2025

Pith/arXiv arXiv 2025
[18]

Singh, A

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

Pith/arXiv arXiv 2025
[19]

Stone, T

A. Stone, T. Xiao, Y . Lu, K. Gopalakrishnan, K.-H. Lee, Q. Vuong, P. Wohlhart, S. Kirmani, B. Zitkovich, F. Xia, et al. Open-world object manipulation using pre-trained vision-language models.arXiv preprint arXiv:2303.00905, 2023

arXiv 2023
[20]

S. Tan, D. Zhou, X. Shao, J. Wang, and G. Sun. Language-conditioned open-vocabulary mobile manipulation with pretrained models.arXiv preprint arXiv:2507.17379, 2025

arXiv 2025
[21]

H. Wang, F. Shahriar, A. Azimi, G. Vasan, R. Mahmood, and C. Bellinger. Versatile and generalizable manipulation via goal- conditioned reinforcement learning with grounded object detection. arXiv preprint arXiv:2507.10814, 2025

arXiv 2025
[22]

S. Wang, D. Kim, A. Taalimi, C. Sun, and W. Kuo. Learning visual grounding from generative vision and language model. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 8057–8067, 2025. doi: 10.1109/W ACV61041.2025.00782

work page doi:10.1109/w 2025
[23]

W. Yan, Q. Yang, S. Huang, Y . Wang, S. Punwani, M. Emberton, V . Stavrinides, Y . Hu, and D. Barratt. Tell2reg: Establishing spatial correspondence between images by the same language prompts, 2025

2025
[24]

Yang, J.-J

X.-W. Yang, J.-J. Shao, L.-Z. Guo, B.-W. Zhang, Z. Zhou, L.-H. Jia, W.-Z. Dai, and Y .-F. Li. Neuro-symbolic artificial intelligence: Towards improving the reasoning abilities of large language models.arXiv preprint arXiv:2508.13678, 2025

arXiv 2025
[25]

F. Zeng, W. Gan, Y . Wang, N. Liu, and P. S. Yu. Large language models for robotics: A survey.arXiv preprint arXiv:2311.07226, 2023

arXiv 2023
[26]

Zhang, P

S. Zhang, P. Wicke, L. K. S ¸enel, L. Figueredo, A. Naceri, S. Haddadin, B. Plank, and H. Sch ¨utze. Lohoravens: A long-horizon language- conditioned benchmark for robotic tabletop manipulation.arXiv preprint arXiv:2310.12020, 2023

arXiv 2023
[27]

Zhang, Z

S. Zhang, Z. Xu, P. Liu, X. Yu, Y . Li, Q. Gao, Z. Fei, Z. Yin, Z. Wu, Y .-G. Jiang, et al. Vlabench: A large-scale benchmark for language- conditioned robotics manipulation with long-horizon reasoning tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11142–11152, 2025

2025
[28]

Z. Zhao, S. Cheng, Y . Ding, Z. Zhou, S. Zhang, D. Xu, and Y . Zhao. A survey of optimization-based task and motion planning: From classical to learning approaches.IEEE/ASME Transactions on Mechatronics, 30(4):2799–2825, 2025. doi: 10.1109/TMECH.2024.3452509

work page doi:10.1109/tmech.2024.3452509 2025
[29]

Agree”) or 5 (“Strongly Agree

M. Zhu, Y . Zhu, J. Li, J. Wen, Z. Xu, Z. Che, C. Shen, Y . Peng, D. Liu, F. Feng, and J. Tang. Language-conditioned robotic manipulation with fast and slow thinking. In2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 4333–4339, 2024. doi: 10. 1109/ICRA57147.2024.10611525 APPENDIX A. Human Survey We evaluated our goal-state genera...

arXiv 2024

[1] [1]

Cohen, J

V . Cohen, J. X. Liu, R. Mooney, S. Tellex, and D. Watkins. A survey of robotic language grounding: Tradeoffs between symbols and embeddings.arXiv preprint arXiv:2405.13245, 2024

arXiv 2024

[2] [2]

C. Cui, C. Zhu, C. Oh, and A. Cavallaro. Improving generalization of language-conditioned robot manipulation. In2025 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pp. 13178– 13184. IEEE, 2025

2025

[3] [3]

M. U. Din, W. Akram, L. S. Saoud, J. Rosell, and I. Hussain. Vision language action models in robotic manipulation: A systematic review. arXiv preprint arXiv:2507.10672, 2025

arXiv 2025

[4] [4]

R. Gong, X. Gao, Q. Gao, S. Shakiah, G. Thattai, and G. S. Sukhatme. Lemma: Learning language-conditioned multi-robot manipulation.IEEE Robotics and Automation Letters, 8(10):6835–6842, 2023. doi: 10.1109/ LRA.2023.3313058

arXiv 2023

[5] [5]

H. Guo, F. Wu, Y . Qin, R. Li, K. Li, and K. Li. Recent trends in task and motion planning for robotics: A survey.ACM Computing Surveys, 55(13s):1–36, 2023

2023

[6] [6]

Herzog, J

J. Herzog, J. Liu, and Y . Wang. Domain-conditioned scene graphs for state-grounded task planning.arXiv preprint arXiv:2504.06661, 2025

arXiv 2025

[7] [7]

Huang, X

H. Huang, X. Chen, Y . Chen, H. Li, X. Han, Z. Wang, T. Wang, J. Pang, and Z. Zhao. Roboground: Robotic manipulation with grounded vision- language priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 22540–22550, 2025

2025

[8] [8]

Huang, A

J. Huang, A. Sethi, M. Kuo, M. Keoliya, N. Velingker, J. Jung, S.-N. Lim, Z. Li, and M. Naik. Esca: Contextualizing embodied agents via scene-graph generation.arXiv preprint arXiv:2510.15963, 2025

arXiv 2025

[9] [9]

Huang, C

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models, 2023

2023

[10] [10]

M. Jia, H. Huang, Z. Zhang, C. Wang, L. Zhao, D. Wang, J. X. Liu, R. Walters, R. Platt, and S. Tellex. Learning efficient and robust language-conditioned manipulation using textual-visual relevancy and equivariant language mapping, 2025

2025

[11] [11]

Q. Li, Y . Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y . Deng, S. Xu, Y . Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650, 2024

Pith/arXiv arXiv 2024

[12] [12]

R. Likert. A technique for the measurement of attitudes.Archives of psychology, 1932

1932

[13] [13]

J. Liu, M. Liu, Z. Wang, P. An, X. Li, K. Zhou, S. Yang, R. Zhang, Y . Guo, and S. Zhang. Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024

2024

[14] [14]

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2024

2024

[15] [15]

O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022. doi: 10.1109/LRA.2022.3180108

work page doi:10.1109/lra.2022.3180108 2022

[16] [16]

A. Ray, J. Arkin, H. Biggie, C. Fan, L. Carlone, and N. Roy. Structured interfaces for automated reasoning with 3d scene graphs.arXiv preprint arXiv:2510.16643, 2025

arXiv 2025

[17] [17]

R. Shao, W. Li, L. Zhang, R. Zhang, Z. Liu, R. Chen, and L. Nie. Large vlm-based vision-language-action models for robotic manipulation: A survey.arXiv preprint arXiv:2508.13073, 2025

Pith/arXiv arXiv 2025

[18] [18]

Singh, A

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

Pith/arXiv arXiv 2025

[19] [19]

Stone, T

A. Stone, T. Xiao, Y . Lu, K. Gopalakrishnan, K.-H. Lee, Q. Vuong, P. Wohlhart, S. Kirmani, B. Zitkovich, F. Xia, et al. Open-world object manipulation using pre-trained vision-language models.arXiv preprint arXiv:2303.00905, 2023

arXiv 2023

[20] [20]

S. Tan, D. Zhou, X. Shao, J. Wang, and G. Sun. Language-conditioned open-vocabulary mobile manipulation with pretrained models.arXiv preprint arXiv:2507.17379, 2025

arXiv 2025

[21] [21]

H. Wang, F. Shahriar, A. Azimi, G. Vasan, R. Mahmood, and C. Bellinger. Versatile and generalizable manipulation via goal- conditioned reinforcement learning with grounded object detection. arXiv preprint arXiv:2507.10814, 2025

arXiv 2025

[22] [22]

S. Wang, D. Kim, A. Taalimi, C. Sun, and W. Kuo. Learning visual grounding from generative vision and language model. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 8057–8067, 2025. doi: 10.1109/W ACV61041.2025.00782

work page doi:10.1109/w 2025

[23] [23]

W. Yan, Q. Yang, S. Huang, Y . Wang, S. Punwani, M. Emberton, V . Stavrinides, Y . Hu, and D. Barratt. Tell2reg: Establishing spatial correspondence between images by the same language prompts, 2025

2025

[24] [24]

Yang, J.-J

X.-W. Yang, J.-J. Shao, L.-Z. Guo, B.-W. Zhang, Z. Zhou, L.-H. Jia, W.-Z. Dai, and Y .-F. Li. Neuro-symbolic artificial intelligence: Towards improving the reasoning abilities of large language models.arXiv preprint arXiv:2508.13678, 2025

arXiv 2025

[25] [25]

F. Zeng, W. Gan, Y . Wang, N. Liu, and P. S. Yu. Large language models for robotics: A survey.arXiv preprint arXiv:2311.07226, 2023

arXiv 2023

[26] [26]

Zhang, P

S. Zhang, P. Wicke, L. K. S ¸enel, L. Figueredo, A. Naceri, S. Haddadin, B. Plank, and H. Sch ¨utze. Lohoravens: A long-horizon language- conditioned benchmark for robotic tabletop manipulation.arXiv preprint arXiv:2310.12020, 2023

arXiv 2023

[27] [27]

Zhang, Z

S. Zhang, Z. Xu, P. Liu, X. Yu, Y . Li, Q. Gao, Z. Fei, Z. Yin, Z. Wu, Y .-G. Jiang, et al. Vlabench: A large-scale benchmark for language- conditioned robotics manipulation with long-horizon reasoning tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11142–11152, 2025

2025

[28] [28]

Z. Zhao, S. Cheng, Y . Ding, Z. Zhou, S. Zhang, D. Xu, and Y . Zhao. A survey of optimization-based task and motion planning: From classical to learning approaches.IEEE/ASME Transactions on Mechatronics, 30(4):2799–2825, 2025. doi: 10.1109/TMECH.2024.3452509

work page doi:10.1109/tmech.2024.3452509 2025

[29] [29]

Agree”) or 5 (“Strongly Agree

M. Zhu, Y . Zhu, J. Li, J. Wen, Z. Xu, Z. Che, C. Shen, Y . Peng, D. Liu, F. Feng, and J. Tang. Language-conditioned robotic manipulation with fast and slow thinking. In2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 4333–4339, 2024. doi: 10. 1109/ICRA57147.2024.10611525 APPENDIX A. Human Survey We evaluated our goal-state genera...

arXiv 2024