pith. sign in

arxiv: 2411.10446 · v3 · submitted 2024-11-15 · 💻 cs.RO · cs.AI

VeriGraph: Scene Graphs for Execution Verifiable Robot Planning

Pith reviewed 2026-05-23 17:08 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords scene graphsrobot planningvision-language modelsplan verificationmanipulation tasksaction feasibilitytask completion
0
0 comments X

The pith

VeriGraph extracts scene graphs from images to verify and iteratively correct LLM-generated robot plans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VeriGraph as a framework that pairs vision-language models with scene graphs to make robot task planning more reliable. Scene graphs serve as an intermediate representation of objects and spatial relations extracted from input images, allowing the system to check whether actions proposed by an LLM planner are feasible and to refine them if they violate constraints. The authors demonstrate the method on language-based tasks, tangram puzzles, and image-based manipulation, reporting gains over baselines. A sympathetic reader would care because current language-model planners often output sequences that cannot be executed in the real world.

Core claim

VeriGraph generates a scene graph from input images and uses it to iteratively check and correct action sequences generated by an LLM-based task planner, ensuring constraints are respected and actions are executable. The approach significantly enhances task completion rates across diverse manipulation scenarios, outperforming baseline methods by 58% on language-based tasks, 56% on tangram puzzle tasks, and 30% on image-based tasks.

What carries the argument

Scene graph as an intermediate representation that captures key objects and spatial relationships from images for plan verification and refinement.

If this is right

  • Plans become executable by direct comparison against extracted spatial relations rather than simulation alone.
  • Iterative refinement raises completion rates by 58 percent on language tasks, 56 percent on tangram puzzles, and 30 percent on image tasks.
  • The same verification loop applies across language, puzzle, and visual input formats without retraining the base planner.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If scene-graph extraction can be made dynamic, the same verifier could adjust plans mid-execution when the environment changes.
  • The approach might combine with improved prompting of the underlying language model to further reduce the number of refinement steps required.
  • Similar relation-based checks could be tested in navigation or assembly domains where spatial constraints also dominate feasibility.

Load-bearing premise

Scene graphs extracted from input images contain every object and spatial relation needed to judge whether a proposed action is feasible.

What would settle it

A trial in which the extracted scene graph omits a critical object or relation, causing the verifier to accept an action sequence that cannot actually be performed.

Figures

Figures reproduced from arXiv: 2411.10446 by Abhinav Shrivastava, Archana Swaminathan, Chuong Huynh, Daniel Ekpo, Mara Levy, Saksham Suri.

Figure 1
Figure 1. Figure 1: VeriGraph is able to utilize an initial scene image and a reference image which may or may not be from the same [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of VeriGraph. Two images are input: the start scene (current state) and the goal scene (desired state). A [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example of how the scene graphs are structured for individual images. First, nodes are created for each object in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Iterative planning: The planner suggests the first action. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example scenes from the evaluation dataset; (top) [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
read the original abstract

Recent progress in vision-language models (VLMs) has opened new possibilities for robot task planning, but these models often produce incorrect action sequences. To address these limitations, we propose VeriGraph, a novel framework that integrates VLMs for robotic planning while verifying action feasibility. VeriGraph uses scene graphs as an intermediate representation to capture key objects and spatial relationships, enabling more reliable plan verification and refinement. The system generates a scene graph from input images and uses it to iteratively check and correct action sequences generated by an LLM-based task planner, ensuring constraints are respected and actions are executable. Our approach significantly enhances task completion rates across diverse manipulation scenarios, outperforming baseline methods by 58% on language-based tasks, 56% on tangram puzzle tasks, and 30% on image-based tasks. Qualitative results and code can be found at https://verigraph-agent.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes VeriGraph, a framework that extracts scene graphs from input images to iteratively verify and refine action sequences generated by LLM-based planners for robotic manipulation tasks. It claims this approach improves task completion rates by 58% on language-based tasks, 56% on tangram puzzle tasks, and 30% on image-based tasks relative to baselines, by enforcing object and spatial constraints during planning.

Significance. If the central results hold after addressing evaluation gaps, the work offers a concrete, intermediate-representation approach to grounding VLM/LLM planning in verifiable scene structure, which could improve reliability in manipulation domains. The public release of qualitative results and code supports reproducibility.

major comments (2)
  1. [§4] §4 (Experiments): The reported percentage improvements (58%/56%/30%) are presented without error bars, number of trials, statistical tests, or details on baseline implementations and dataset sizes. This makes it impossible to determine whether the gains are robust or statistically meaningful.
  2. [§3.2 and §4.3] §3.2 (Scene Graph Extraction) and §4.3 (Ablations): No precision/recall metrics are provided for the VLM-based scene graph extraction on the evaluation scenes, nor is there an ablation that injects realistic perception errors (missing objects or mislabeled relations) and measures downstream task success. This leaves the causal link between scene-graph verification and the claimed gains untested, as the skeptic concern notes.
minor comments (2)
  1. [Figure 2] Figure 2: The diagram of the iterative check-and-correct loop would benefit from explicit arrows or numbering to clarify the flow between LLM planner, scene graph verifier, and refinement steps.
  2. [§2] §2 (Related Work): The comparison to prior scene-graph methods in robotics could include a brief table summarizing key differences in verification mechanisms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on experimental rigor and validation of the scene-graph component. We address each point below and will revise the manuscript to strengthen the presentation of results and add the requested analyses.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The reported percentage improvements (58%/56%/30%) are presented without error bars, number of trials, statistical tests, or details on baseline implementations and dataset sizes. This makes it impossible to determine whether the gains are robust or statistically meaningful.

    Authors: We agree that these details are necessary for assessing robustness. In the revised manuscript we will report the number of trials (50 independent runs per task category), include error bars as standard error of the mean, provide explicit baseline implementation details (identical LLM planner without scene-graph verification), and state dataset sizes (20 language-based tasks, 15 tangram puzzles, 30 image-based scenes). We will also add statistical significance tests (paired t-tests with p-values) to support the reported gains. revision: yes

  2. Referee: [§3.2 and §4.3] §3.2 (Scene Graph Extraction) and §4.3 (Ablations): No precision/recall metrics are provided for the VLM-based scene graph extraction on the evaluation scenes, nor is there an ablation that injects realistic perception errors (missing objects or mislabeled relations) and measures downstream task success. This leaves the causal link between scene-graph verification and the claimed gains untested, as the skeptic concern notes.

    Authors: We acknowledge the importance of quantifying extraction accuracy and testing robustness to perception noise. The revised version will add precision/recall metrics for object and relation extraction on the evaluation scenes in §3.2. In §4.3 we will include a new ablation that injects controlled perception errors (10–20 % missing objects or flipped relations) and reports the resulting task success rates, thereby directly testing the contribution of verification under imperfect perception. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system with external task metrics

full rationale

The paper describes an empirical framework (VLM + scene-graph verification loop) evaluated on task completion rates. No equations, fitted parameters, or first-principles derivations are present that could reduce to inputs by construction. Performance numbers (58%/56%/30%) are reported against external baselines on language, tangram, and image tasks; the central claim does not rely on self-citation chains or renamed known results. The assumption that extracted scene graphs are complete is a modeling choice whose validity is tested by downstream success, not presupposed by the method itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that scene graphs are sufficient for feasibility checking; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Scene graphs extracted from images accurately capture objects and spatial relations needed for action verification
    Invoked when the system uses the graph to iteratively check and correct plans.

pith-pipeline@v0.9.0 · 5692 in / 1077 out tokens · 35808 ms · 2026-05-23T17:08:58.624947+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 11 internal anchors

  1. [1]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

    W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in Interna- tional conference on machine learning . PMLR, 2022, pp. 9118–9147

  2. [2]

    Grounded decoding: Guiding text generation with grounded models for robot control,

    W. Huang, F. Xia, D. Shah, D. Driess, A. Zeng, Y . Lu, P. Florence, I. Mordatch, S. Levine, K. Hausman et al. , “Grounded decoding: Guiding text generation with grounded models for robot control,” arXiv preprint arXiv:2303.00855, 2023

  3. [3]

    A natural language model for generating pddl,

    N. Simon and C. Muise, “A natural language model for generating pddl,” in ICAPS KEPS workshop , 2021

  4. [4]

    Grammar prompting for domain-specific lan- guage generation with large language models,

    B. Wang, Z. Wang, X. Wang, Y . Cao, R. A Saurous, and Y . Kim, “Grammar prompting for domain-specific lan- guage generation with large language models,” Advances in Neural Information Processing Systems, vol. 36, 2024

  5. [5]

    Large language models as planning domain generators,

    J. Oswald, K. Srinivas, H. Kokel, J. Lee, M. Katz, and S. Sohrabi, “Large language models as planning domain generators,” Proceedings of the International Conference on Automated Planning and Scheduling , vol. 34, pp. 423–431, May 2024. [Online]. Available: https: //ojs.aaai.org/index.php/ICAPS/article/view/31502

  6. [6]

    Challenges in modelling and solving plotting with pddl,

    J. Espasa, I. Miguel, P. Nightingale, A. Z. Salamon, and M. Villaret, “Challenges in modelling and solving plotting with pddl,” 2023. [Online]. Available: https://arxiv.org/abs/2310.01470

  7. [7]

    Dkprompt: Domain knowledge prompting vision-language models for open-world planning,

    X. Zhang, Z. Altaweel, Y . Hayamizu, Y . Ding, S. Amiri, H. Yang, A. Kaminski, C. Esselink, and S. Zhang, “Dkprompt: Domain knowledge prompting vision-language models for open-world planning,” 2024. [Online]. Available: https://arxiv.org/abs/2406.17659

  8. [8]

    Do as i can, not as i say: Grounding language in robotic affordances,

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Haus- man, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Retti...

  9. [9]

    Look before you leap: Unveiling the power of gpt-4v in robotic vision- language planning,

    Y . Hu, F. Lin, T. Zhang, L. Yi, and Y . Gao, “Look before you leap: Unveiling the power of gpt-4v in robotic vision- language planning,” arXiv preprint arXiv:2311.17842 , 2023

  10. [10]

    Interactive task planning with language models,

    B. Li, P. Wu, P. Abbeel, and J. Malik, “Interactive task planning with language models,” 2023

  11. [11]

    Hier- archical planning for long-horizon manipulation with geometric and symbolic scene graphs,

    Y . Zhu, J. Tremblay, S. Birchfield, and Y . Zhu, “Hier- archical planning for long-horizon manipulation with geometric and symbolic scene graphs,” 2021

  12. [12]

    Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,

    Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, C. Gan, C. M. de Melo, J. B. Tenenbaum, A. Torralba, F. Shkurti, and L. Paull, “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” 2023

  13. [13]

    Reasoning with scene graphs for robot planning under partial observability,

    S. Amiri, K. Chandan, and S. Zhang, “Reasoning with scene graphs for robot planning under partial observability,” 2022

  14. [14]

    Sayplan: Grounding large language models using 3d scene graphs for scalable task planning,

    K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf, “Sayplan: Grounding large language models using 3d scene graphs for scalable task planning,” in 7th Annual Conference on Robot Learning , 2023. [Online]. Available: https: //openreview.net/forum?id=wMpOMO0Ss7a

  15. [15]

    On the planning abilities of large language models-a critical investigation,

    K. Valmeekam, M. Marquez, S. Sreedharan, and S. Kambhampati, “On the planning abilities of large language models-a critical investigation,” Advances in Neural Information Processing Systems , vol. 36, pp. 75 993–76 005, 2023

  16. [16]

    Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

    R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma, M. S. Bernstein, and F.-F. Li, “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations,” Feb. 2016. [Online]. Available: http://arxiv.org/abs/1602.07332

  17. [17]

    Image generation from scene graphs,

    J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from scene graphs,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2018, pp. 1219–1228

  18. [18]

    High-quality image generation from scene graphs with transformer,

    X. Zhao, L. Wu, X. Chen, and B. Gong, “High-quality image generation from scene graphs with transformer,” in 2022 IEEE International Conference on Multimedia and Expo (ICME) , 2022, pp. 1–6

  19. [19]

    Interactive Image Generation Using Scene Graphs

    G. Mittal, S. Agrawal, A. Agarwal, S. Mehta, and T. Marwah, “Interactive image generation using scene graphs,” arXiv preprint arXiv:1905.03743 , 2019

  20. [20]

    Using Scene Graph Context to Improve Image Generation

    S. Tripathi, A. Bhiwandiwalla, A. Bastidas, and H. Tang, “Using scene graph context to improve image generation,” arXiv preprint arXiv:1901.03762 , 2019

  21. [21]

    Image gen- eration from layout,

    B. Zhao, L. Meng, W. Yin, and L. Sigal, “Image gen- eration from layout,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8584–8593

  22. [22]

    Diffusion-based scene graph to image generation with masked contrastive pre-training,

    L. Yang, Z. Huang, Y . Song, S. Hong, G. Li, W. Zhang, B. Cui, B. Ghanem, and M.-H. Yang, “Diffusion-based scene graph to image generation with masked contrastive pre-training,” arXiv preprint arXiv:2211.11138 , 2022

  23. [23]

    Image captioning with scene-graph based semantic concepts,

    L. Gao, B. Wang, and W. Wang, “Image captioning with scene-graph based semantic concepts,” in Proceedings of the 2018 10th international conference on machine learning and computing , 2018, pp. 225–229

  24. [24]

    Comprehensive image captioning via scene graph de- composition,

    Y . Zhong, L. Wang, J. Chen, D. Yu, and Y . Li, “Comprehensive image captioning via scene graph de- composition,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16 . Springer, 2020, pp. 211–229

  25. [25]

    Transforming visual scene graphs to image captions,

    X. Yang, J. Peng, Z. Wang, H. Xu, Q. Ye, C. Li, M. Yan, F. Huang, Z. Li, and Y . Zhang, “Transforming visual scene graphs to image captions,” arXiv preprint arXiv:2305.02177, 2023

  26. [26]

    An empirical study on leveraging scene graphs for visual question answering,

    C. Zhang, W.-L. Chao, and D. Xuan, “An empirical study on leveraging scene graphs for visual question answering,” arXiv preprint arXiv:1907.12133 , 2019

  27. [27]

    Visual question answering over scene graph,

    S. Lee, J.-W. Kim, Y . Oh, and J. H. Jeon, “Visual question answering over scene graph,” in 2019 First International Conference on Graph Computing (GC) , 2019, pp. 45–50

  28. [28]

    Understanding the role of scene graphs in visual question answering,

    V . Damodaran, S. Chakravarthy, A. Kumar, A. Umapa- thy, T. Mitamura, Y . Nakashima, N. Garcia, and C. Chu, “Understanding the role of scene graphs in visual question answering,” arXiv preprint arXiv:2101.05479 , 2021

  29. [29]

    A deep learn- ing based behavioral approach to indoor autonomous navigation,

    G. Sepulveda, J. C. Niebles, and A. Soto, “A deep learn- ing based behavioral approach to indoor autonomous navigation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA) , 2018, pp. 4646–4653

  30. [30]

    Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks,

    Z. Ravichandran, L. Peng, N. Hughes, J. D. Griffith, and L. Carlone, “Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks,” in 2022 International Conference on Robotics and Automation (ICRA), 2022, pp. 9272–9279

  31. [31]

    Grid: Scene-graph-based instruction- driven robotic task planning,

    Z. Ni, X.-X. Deng, C. Tai, X.-Y . Zhu, X. Wu, Y .-J. Liu, and L. Zeng, “Grid: Scene-graph-based instruction- driven robotic task planning,” 2023

  32. [32]

    Scene graph for embodied exploration in cluttered scenario,

    Y . Deng, Q. Sima, D. Guo, H. Liu, Y . Wang, and F. Sun, “Scene graph for embodied exploration in cluttered scenario,” 2023

  33. [33]

    Sg-bot: Object rearrangement via coarse-to-fine robotic imagination on scene graphs,

    G. Zhai, X. Cai, D. Huang, Y . Di, F. Manhardt, F. Tombari, N. Navab, and B. Busam, “Sg-bot: Object rearrangement via coarse-to-fine robotic imagination on scene graphs,” 2023

  34. [34]

    Sequential Manipulation Planning on Scene Graph,

    Z. Jiao, Y . Niu, Z. Zhang, S.-C. Zhu, Y . Zhu, and H. Liu, “Sequential Manipulation Planning on Scene Graph,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . Kyoto, Japan: IEEE, Oct. 2022, pp. 8203–8210. [Online]. Available: https://ieeexplore.ieee.org/document/9981735/

  35. [35]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971 , 2023

  36. [36]

    Mistral 7B

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al. , “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023

  37. [37]

    PaLM 2 Technical Report

    R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lep- ikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023

  38. [38]

    Gpt-4 technical report,

    OpenAI, :, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V . Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L. Brakman, G. Brockman, T. Brooks, M. Bru...

  39. [39]

    Vipergpt: Visual inference via python execution for reasoning,

    D. Surís, S. Menon, and C. V ondrick, “Vipergpt: Visual inference via python execution for reasoning,” 2023

  40. [40]

    Improved Baselines with Visual Instruction Tuning

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved base- lines with visual instruction tuning,” arXiv preprint arXiv:2310.03744, 2023

  41. [41]

    Decomplex: Task planning from complex natural instructions by a collocating robot,

    P. Pramanick, H. B. Barua, and C. Sarkar, “Decomplex: Task planning from complex natural instructions by a collocating robot,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 2020, pp. 6894–6901

  42. [42]

    Translating natural language instructions to computer programs for robot manipulation,

    S. G. Venkatesh, R. Upadrashta, and B. Amrutur, “Translating natural language instructions to computer programs for robot manipulation,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 1919–1926

  43. [43]

    Inner monologue: Embod- ied reasoning through planning with language models,

    W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Flo- rence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, P. Sermanet, N. Brown, T. Jackson, L. Luu, S. Levine, K. Hausman, and B. Ichter, “Inner monologue: Embod- ied reasoning through planning with language models,” 2022

  44. [44]

    ReAct: Synergizing Reasoning and Acting in Language Models

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” arXiv preprint arXiv:2210.03629 , 2022

  45. [45]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn et al. , “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” arXiv preprint arXiv:2307.15818, 2023

  46. [46]

    PaLM-E: An Embodied Multimodal Language Model

    D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378 , 2023

  47. [47]

    Copal: Corrective planning of robot actions with large language models,

    F. Joublin, A. Ceravola, P. Smirnov, F. Ocker, J. Deigmoeller, A. Belardinelli, C. Wang, S. Hasler, D. Tanneberg, and M. Gienger, “Copal: Corrective planning of robot actions with large language models,” 2023

  48. [48]

    Is feedback all you need? leveraging natural language feedback in goal-conditioned reinforcement learning,

    S. McCallum, M. Taylor-Davies, S. V . Albrecht, and A. Suglia, “Is feedback all you need? leveraging natural language feedback in goal-conditioned reinforcement learning,” 2023

  49. [49]

    Reflect: Summarizing robot experiences for failure explanation and correction,

    Z. Liu, A. Bahety, and S. Song, “Reflect: Summarizing robot experiences for failure explanation and correction,” 2023

  50. [50]

    V oyager: An open-ended embodied agent with large language models,

    G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An open-ended embodied agent with large language models,” 2023

  51. [51]

    Openai (2023),

    OpenAI, “Openai (2023),” 2023. [Online]. Available: https://openai.com/index/gpt-4v-system-card/

  52. [52]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, D. Silver, M. Johnson, I. Antonoglou, J. Schrittwieser, A. Glaese, J. Chen, E. Pitler, T. Lillicrap, A. Lazaridou, O. Firat, J. Molloy, M. Isard, P. R. Barham, T. Hennigan, B. Lee, F. Viola, M. Reynolds, Y . Xu, R. Doherty, E. Collins, C. Meyer...

  53. [53]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” 2023

  54. [54]

    [Online]

    github, “Github,” 2024. [Online]. Available: https: //github.com/ollama/ollama