VeriGraph: Scene Graphs for Execution Verifiable Robot Planning

Abhinav Shrivastava; Archana Swaminathan; Chuong Huynh; Daniel Ekpo; Mara Levy; Saksham Suri

arxiv: 2411.10446 · v3 · submitted 2024-11-15 · 💻 cs.RO · cs.AI

VeriGraph: Scene Graphs for Execution Verifiable Robot Planning

Daniel Ekpo , Mara Levy , Saksham Suri , Chuong Huynh , Archana Swaminathan , Abhinav Shrivastava This is my paper

Pith reviewed 2026-05-23 17:08 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords scene graphsrobot planningvision-language modelsplan verificationmanipulation tasksaction feasibilitytask completion

0 comments

The pith

VeriGraph extracts scene graphs from images to verify and iteratively correct LLM-generated robot plans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VeriGraph as a framework that pairs vision-language models with scene graphs to make robot task planning more reliable. Scene graphs serve as an intermediate representation of objects and spatial relations extracted from input images, allowing the system to check whether actions proposed by an LLM planner are feasible and to refine them if they violate constraints. The authors demonstrate the method on language-based tasks, tangram puzzles, and image-based manipulation, reporting gains over baselines. A sympathetic reader would care because current language-model planners often output sequences that cannot be executed in the real world.

Core claim

VeriGraph generates a scene graph from input images and uses it to iteratively check and correct action sequences generated by an LLM-based task planner, ensuring constraints are respected and actions are executable. The approach significantly enhances task completion rates across diverse manipulation scenarios, outperforming baseline methods by 58% on language-based tasks, 56% on tangram puzzle tasks, and 30% on image-based tasks.

What carries the argument

Scene graph as an intermediate representation that captures key objects and spatial relationships from images for plan verification and refinement.

If this is right

Plans become executable by direct comparison against extracted spatial relations rather than simulation alone.
Iterative refinement raises completion rates by 58 percent on language tasks, 56 percent on tangram puzzles, and 30 percent on image tasks.
The same verification loop applies across language, puzzle, and visual input formats without retraining the base planner.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If scene-graph extraction can be made dynamic, the same verifier could adjust plans mid-execution when the environment changes.
The approach might combine with improved prompting of the underlying language model to further reduce the number of refinement steps required.
Similar relation-based checks could be tested in navigation or assembly domains where spatial constraints also dominate feasibility.

Load-bearing premise

Scene graphs extracted from input images contain every object and spatial relation needed to judge whether a proposed action is feasible.

What would settle it

A trial in which the extracted scene graph omits a critical object or relation, causing the verifier to accept an action sequence that cannot actually be performed.

Figures

Figures reproduced from arXiv: 2411.10446 by Abhinav Shrivastava, Archana Swaminathan, Chuong Huynh, Daniel Ekpo, Mara Levy, Saksham Suri.

**Figure 1.** Figure 1: VeriGraph is able to utilize an initial scene image and a reference image which may or may not be from the same [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of VeriGraph. Two images are input: the start scene (current state) and the goal scene (desired state). A [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: An example of how the scene graphs are structured for individual images. First, nodes are created for each object in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Iterative planning: The planner suggests the first action. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Example scenes from the evaluation dataset; (top) [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

read the original abstract

Recent progress in vision-language models (VLMs) has opened new possibilities for robot task planning, but these models often produce incorrect action sequences. To address these limitations, we propose VeriGraph, a novel framework that integrates VLMs for robotic planning while verifying action feasibility. VeriGraph uses scene graphs as an intermediate representation to capture key objects and spatial relationships, enabling more reliable plan verification and refinement. The system generates a scene graph from input images and uses it to iteratively check and correct action sequences generated by an LLM-based task planner, ensuring constraints are respected and actions are executable. Our approach significantly enhances task completion rates across diverse manipulation scenarios, outperforming baseline methods by 58% on language-based tasks, 56% on tangram puzzle tasks, and 30% on image-based tasks. Qualitative results and code can be found at https://verigraph-agent.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VeriGraph adds an iterative scene-graph check to LLM robot planning and shows task gains, but the gains rest on untested extraction quality.

read the letter

The core move is using scene graphs extracted from images to verify and refine LLM-generated action sequences before they reach the robot. That loop is the actual addition over plain VLM planning, and the paper ships code plus qualitative examples that make the mechanism concrete. The reported lifts—58% on language tasks, 56% on tangram, 30% on image tasks—are the numbers that would matter if they hold up under scrutiny. The approach is straightforward enough that groups already running VLM planners could try the verification step without starting from scratch. What is missing is any direct test of the extraction step itself. The stress-test point stands: if the graph misses an object or a spatial relation, the feasibility check has nothing to work with, yet the paper gives no precision-recall numbers on the graphs for the evaluation scenes and no ablation that adds realistic perception noise. The abstract also omits error bars, dataset sizes, and baseline implementation details, so it is hard to tell how much of the improvement comes from the graph loop versus other factors like prompting or task selection. This is the kind of paper that belongs in a robotics planning reading group for the implementation details rather than the claims. A serious editor should send it out for review; the integration is real and the results are worth checking, but the authors will need to add the missing controls on the perception side before the central claim is convincing.

Referee Report

2 major / 2 minor

Summary. The paper proposes VeriGraph, a framework that extracts scene graphs from input images to iteratively verify and refine action sequences generated by LLM-based planners for robotic manipulation tasks. It claims this approach improves task completion rates by 58% on language-based tasks, 56% on tangram puzzle tasks, and 30% on image-based tasks relative to baselines, by enforcing object and spatial constraints during planning.

Significance. If the central results hold after addressing evaluation gaps, the work offers a concrete, intermediate-representation approach to grounding VLM/LLM planning in verifiable scene structure, which could improve reliability in manipulation domains. The public release of qualitative results and code supports reproducibility.

major comments (2)

[§4] §4 (Experiments): The reported percentage improvements (58%/56%/30%) are presented without error bars, number of trials, statistical tests, or details on baseline implementations and dataset sizes. This makes it impossible to determine whether the gains are robust or statistically meaningful.
[§3.2 and §4.3] §3.2 (Scene Graph Extraction) and §4.3 (Ablations): No precision/recall metrics are provided for the VLM-based scene graph extraction on the evaluation scenes, nor is there an ablation that injects realistic perception errors (missing objects or mislabeled relations) and measures downstream task success. This leaves the causal link between scene-graph verification and the claimed gains untested, as the skeptic concern notes.

minor comments (2)

[Figure 2] Figure 2: The diagram of the iterative check-and-correct loop would benefit from explicit arrows or numbering to clarify the flow between LLM planner, scene graph verifier, and refinement steps.
[§2] §2 (Related Work): The comparison to prior scene-graph methods in robotics could include a brief table summarizing key differences in verification mechanisms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on experimental rigor and validation of the scene-graph component. We address each point below and will revise the manuscript to strengthen the presentation of results and add the requested analyses.

read point-by-point responses

Referee: [§4] §4 (Experiments): The reported percentage improvements (58%/56%/30%) are presented without error bars, number of trials, statistical tests, or details on baseline implementations and dataset sizes. This makes it impossible to determine whether the gains are robust or statistically meaningful.

Authors: We agree that these details are necessary for assessing robustness. In the revised manuscript we will report the number of trials (50 independent runs per task category), include error bars as standard error of the mean, provide explicit baseline implementation details (identical LLM planner without scene-graph verification), and state dataset sizes (20 language-based tasks, 15 tangram puzzles, 30 image-based scenes). We will also add statistical significance tests (paired t-tests with p-values) to support the reported gains. revision: yes
Referee: [§3.2 and §4.3] §3.2 (Scene Graph Extraction) and §4.3 (Ablations): No precision/recall metrics are provided for the VLM-based scene graph extraction on the evaluation scenes, nor is there an ablation that injects realistic perception errors (missing objects or mislabeled relations) and measures downstream task success. This leaves the causal link between scene-graph verification and the claimed gains untested, as the skeptic concern notes.

Authors: We acknowledge the importance of quantifying extraction accuracy and testing robustness to perception noise. The revised version will add precision/recall metrics for object and relation extraction on the evaluation scenes in §3.2. In §4.3 we will include a new ablation that injects controlled perception errors (10–20 % missing objects or flipped relations) and reports the resulting task success rates, thereby directly testing the contribution of verification under imperfect perception. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system with external task metrics

full rationale

The paper describes an empirical framework (VLM + scene-graph verification loop) evaluated on task completion rates. No equations, fitted parameters, or first-principles derivations are present that could reduce to inputs by construction. Performance numbers (58%/56%/30%) are reported against external baselines on language, tangram, and image tasks; the central claim does not rely on self-citation chains or renamed known results. The assumption that extracted scene graphs are complete is a modeling choice whose validity is tested by downstream success, not presupposed by the method itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that scene graphs are sufficient for feasibility checking; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Scene graphs extracted from images accurately capture objects and spatial relations needed for action verification
Invoked when the system uses the graph to iteratively check and correct plans.

pith-pipeline@v0.9.0 · 5692 in / 1077 out tokens · 35808 ms · 2026-05-23T17:08:58.624947+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 11 internal anchors

[1]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in Interna- tional conference on machine learning . PMLR, 2022, pp. 9118–9147

work page 2022
[2]

Grounded decoding: Guiding text generation with grounded models for robot control,

W. Huang, F. Xia, D. Shah, D. Driess, A. Zeng, Y . Lu, P. Florence, I. Mordatch, S. Levine, K. Hausman et al. , “Grounded decoding: Guiding text generation with grounded models for robot control,” arXiv preprint arXiv:2303.00855, 2023

work page arXiv 2023
[3]

A natural language model for generating pddl,

N. Simon and C. Muise, “A natural language model for generating pddl,” in ICAPS KEPS workshop , 2021

work page 2021
[4]

Grammar prompting for domain-specific lan- guage generation with large language models,

B. Wang, Z. Wang, X. Wang, Y . Cao, R. A Saurous, and Y . Kim, “Grammar prompting for domain-specific lan- guage generation with large language models,” Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[5]

Large language models as planning domain generators,

J. Oswald, K. Srinivas, H. Kokel, J. Lee, M. Katz, and S. Sohrabi, “Large language models as planning domain generators,” Proceedings of the International Conference on Automated Planning and Scheduling , vol. 34, pp. 423–431, May 2024. [Online]. Available: https: //ojs.aaai.org/index.php/ICAPS/article/view/31502

work page 2024
[6]

Challenges in modelling and solving plotting with pddl,

J. Espasa, I. Miguel, P. Nightingale, A. Z. Salamon, and M. Villaret, “Challenges in modelling and solving plotting with pddl,” 2023. [Online]. Available: https://arxiv.org/abs/2310.01470

work page arXiv 2023
[7]

Dkprompt: Domain knowledge prompting vision-language models for open-world planning,

X. Zhang, Z. Altaweel, Y . Hayamizu, Y . Ding, S. Amiri, H. Yang, A. Kaminski, C. Esselink, and S. Zhang, “Dkprompt: Domain knowledge prompting vision-language models for open-world planning,” 2024. [Online]. Available: https://arxiv.org/abs/2406.17659

work page arXiv 2024
[8]

Do as i can, not as i say: Grounding language in robotic affordances,

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Haus- man, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Retti...

work page 2022
[9]

Look before you leap: Unveiling the power of gpt-4v in robotic vision- language planning,

Y . Hu, F. Lin, T. Zhang, L. Yi, and Y . Gao, “Look before you leap: Unveiling the power of gpt-4v in robotic vision- language planning,” arXiv preprint arXiv:2311.17842 , 2023

work page arXiv 2023
[10]

Interactive task planning with language models,

B. Li, P. Wu, P. Abbeel, and J. Malik, “Interactive task planning with language models,” 2023

work page 2023
[11]

Hier- archical planning for long-horizon manipulation with geometric and symbolic scene graphs,

Y . Zhu, J. Tremblay, S. Birchfield, and Y . Zhu, “Hier- archical planning for long-horizon manipulation with geometric and symbolic scene graphs,” 2021

work page 2021
[12]

Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,

Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, C. Gan, C. M. de Melo, J. B. Tenenbaum, A. Torralba, F. Shkurti, and L. Paull, “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” 2023

work page 2023
[13]

Reasoning with scene graphs for robot planning under partial observability,

S. Amiri, K. Chandan, and S. Zhang, “Reasoning with scene graphs for robot planning under partial observability,” 2022

work page 2022
[14]

Sayplan: Grounding large language models using 3d scene graphs for scalable task planning,

K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf, “Sayplan: Grounding large language models using 3d scene graphs for scalable task planning,” in 7th Annual Conference on Robot Learning , 2023. [Online]. Available: https: //openreview.net/forum?id=wMpOMO0Ss7a

work page 2023
[15]

On the planning abilities of large language models-a critical investigation,

K. Valmeekam, M. Marquez, S. Sreedharan, and S. Kambhampati, “On the planning abilities of large language models-a critical investigation,” Advances in Neural Information Processing Systems , vol. 36, pp. 75 993–76 005, 2023

work page 2023
[16]

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma, M. S. Bernstein, and F.-F. Li, “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations,” Feb. 2016. [Online]. Available: http://arxiv.org/abs/1602.07332

work page internal anchor Pith review Pith/arXiv arXiv 2016
[17]

Image generation from scene graphs,

J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from scene graphs,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2018, pp. 1219–1228

work page 2018
[18]

High-quality image generation from scene graphs with transformer,

X. Zhao, L. Wu, X. Chen, and B. Gong, “High-quality image generation from scene graphs with transformer,” in 2022 IEEE International Conference on Multimedia and Expo (ICME) , 2022, pp. 1–6

work page 2022
[19]

Interactive Image Generation Using Scene Graphs

G. Mittal, S. Agrawal, A. Agarwal, S. Mehta, and T. Marwah, “Interactive image generation using scene graphs,” arXiv preprint arXiv:1905.03743 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[20]

Using Scene Graph Context to Improve Image Generation

S. Tripathi, A. Bhiwandiwalla, A. Bastidas, and H. Tang, “Using scene graph context to improve image generation,” arXiv preprint arXiv:1901.03762 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[21]

Image gen- eration from layout,

B. Zhao, L. Meng, W. Yin, and L. Sigal, “Image gen- eration from layout,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8584–8593

work page 2019
[22]

Diffusion-based scene graph to image generation with masked contrastive pre-training,

L. Yang, Z. Huang, Y . Song, S. Hong, G. Li, W. Zhang, B. Cui, B. Ghanem, and M.-H. Yang, “Diffusion-based scene graph to image generation with masked contrastive pre-training,” arXiv preprint arXiv:2211.11138 , 2022

work page arXiv 2022
[23]

Image captioning with scene-graph based semantic concepts,

L. Gao, B. Wang, and W. Wang, “Image captioning with scene-graph based semantic concepts,” in Proceedings of the 2018 10th international conference on machine learning and computing , 2018, pp. 225–229

work page 2018
[24]

Comprehensive image captioning via scene graph de- composition,

Y . Zhong, L. Wang, J. Chen, D. Yu, and Y . Li, “Comprehensive image captioning via scene graph de- composition,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16 . Springer, 2020, pp. 211–229

work page 2020
[25]

Transforming visual scene graphs to image captions,

X. Yang, J. Peng, Z. Wang, H. Xu, Q. Ye, C. Li, M. Yan, F. Huang, Z. Li, and Y . Zhang, “Transforming visual scene graphs to image captions,” arXiv preprint arXiv:2305.02177, 2023

work page arXiv 2023
[26]

An empirical study on leveraging scene graphs for visual question answering,

C. Zhang, W.-L. Chao, and D. Xuan, “An empirical study on leveraging scene graphs for visual question answering,” arXiv preprint arXiv:1907.12133 , 2019

work page arXiv 1907
[27]

Visual question answering over scene graph,

S. Lee, J.-W. Kim, Y . Oh, and J. H. Jeon, “Visual question answering over scene graph,” in 2019 First International Conference on Graph Computing (GC) , 2019, pp. 45–50

work page 2019
[28]

Understanding the role of scene graphs in visual question answering,

V . Damodaran, S. Chakravarthy, A. Kumar, A. Umapa- thy, T. Mitamura, Y . Nakashima, N. Garcia, and C. Chu, “Understanding the role of scene graphs in visual question answering,” arXiv preprint arXiv:2101.05479 , 2021

work page arXiv 2021
[29]

A deep learn- ing based behavioral approach to indoor autonomous navigation,

G. Sepulveda, J. C. Niebles, and A. Soto, “A deep learn- ing based behavioral approach to indoor autonomous navigation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA) , 2018, pp. 4646–4653

work page 2018
[30]

Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks,

Z. Ravichandran, L. Peng, N. Hughes, J. D. Griffith, and L. Carlone, “Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks,” in 2022 International Conference on Robotics and Automation (ICRA), 2022, pp. 9272–9279

work page 2022
[31]

Grid: Scene-graph-based instruction- driven robotic task planning,

Z. Ni, X.-X. Deng, C. Tai, X.-Y . Zhu, X. Wu, Y .-J. Liu, and L. Zeng, “Grid: Scene-graph-based instruction- driven robotic task planning,” 2023

work page 2023
[32]

Scene graph for embodied exploration in cluttered scenario,

Y . Deng, Q. Sima, D. Guo, H. Liu, Y . Wang, and F. Sun, “Scene graph for embodied exploration in cluttered scenario,” 2023

work page 2023
[33]

Sg-bot: Object rearrangement via coarse-to-fine robotic imagination on scene graphs,

G. Zhai, X. Cai, D. Huang, Y . Di, F. Manhardt, F. Tombari, N. Navab, and B. Busam, “Sg-bot: Object rearrangement via coarse-to-fine robotic imagination on scene graphs,” 2023

work page 2023
[34]

Sequential Manipulation Planning on Scene Graph,

Z. Jiao, Y . Niu, Z. Zhang, S.-C. Zhu, Y . Zhu, and H. Liu, “Sequential Manipulation Planning on Scene Graph,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . Kyoto, Japan: IEEE, Oct. 2022, pp. 8203–8210. [Online]. Available: https://ieeexplore.ieee.org/document/9981735/

work page arXiv 2022
[35]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Mistral 7B

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al. , “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

PaLM 2 Technical Report

R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lep- ikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Gpt-4 technical report,

OpenAI, :, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V . Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L. Brakman, G. Brockman, T. Brooks, M. Bru...

work page 2024
[39]

Vipergpt: Visual inference via python execution for reasoning,

D. Surís, S. Menon, and C. V ondrick, “Vipergpt: Visual inference via python execution for reasoning,” 2023

work page 2023
[40]

Improved Baselines with Visual Instruction Tuning

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved base- lines with visual instruction tuning,” arXiv preprint arXiv:2310.03744, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Decomplex: Task planning from complex natural instructions by a collocating robot,

P. Pramanick, H. B. Barua, and C. Sarkar, “Decomplex: Task planning from complex natural instructions by a collocating robot,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 2020, pp. 6894–6901

work page 2020
[42]

Translating natural language instructions to computer programs for robot manipulation,

S. G. Venkatesh, R. Upadrashta, and B. Amrutur, “Translating natural language instructions to computer programs for robot manipulation,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 1919–1926

work page 2021
[43]

Inner monologue: Embod- ied reasoning through planning with language models,

W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Flo- rence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, P. Sermanet, N. Brown, T. Jackson, L. Luu, S. Levine, K. Hausman, and B. Ichter, “Inner monologue: Embod- ied reasoning through planning with language models,” 2022

work page 2022
[44]

ReAct: Synergizing Reasoning and Acting in Language Models

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” arXiv preprint arXiv:2210.03629 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn et al. , “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Copal: Corrective planning of robot actions with large language models,

F. Joublin, A. Ceravola, P. Smirnov, F. Ocker, J. Deigmoeller, A. Belardinelli, C. Wang, S. Hasler, D. Tanneberg, and M. Gienger, “Copal: Corrective planning of robot actions with large language models,” 2023

work page 2023
[48]

Is feedback all you need? leveraging natural language feedback in goal-conditioned reinforcement learning,

S. McCallum, M. Taylor-Davies, S. V . Albrecht, and A. Suglia, “Is feedback all you need? leveraging natural language feedback in goal-conditioned reinforcement learning,” 2023

work page 2023
[49]

Reflect: Summarizing robot experiences for failure explanation and correction,

Z. Liu, A. Bahety, and S. Song, “Reflect: Summarizing robot experiences for failure explanation and correction,” 2023

work page 2023
[50]

V oyager: An open-ended embodied agent with large language models,

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An open-ended embodied agent with large language models,” 2023

work page 2023
[51]

Openai (2023),

OpenAI, “Openai (2023),” 2023. [Online]. Available: https://openai.com/index/gpt-4v-system-card/

work page 2023
[52]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, D. Silver, M. Johnson, I. Antonoglou, J. Schrittwieser, A. Glaese, J. Chen, E. Pitler, T. Lillicrap, A. Lazaridou, O. Firat, J. Molloy, M. Isard, P. R. Barham, T. Hennigan, B. Lee, F. Viola, M. Reynolds, Y . Xu, R. Doherty, E. Collins, C. Meyer...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” 2023

work page 2023
[54]

[Online]

github, “Github,” 2024. [Online]. Available: https: //github.com/ollama/ollama

work page 2024

[1] [1]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in Interna- tional conference on machine learning . PMLR, 2022, pp. 9118–9147

work page 2022

[2] [2]

Grounded decoding: Guiding text generation with grounded models for robot control,

W. Huang, F. Xia, D. Shah, D. Driess, A. Zeng, Y . Lu, P. Florence, I. Mordatch, S. Levine, K. Hausman et al. , “Grounded decoding: Guiding text generation with grounded models for robot control,” arXiv preprint arXiv:2303.00855, 2023

work page arXiv 2023

[3] [3]

A natural language model for generating pddl,

N. Simon and C. Muise, “A natural language model for generating pddl,” in ICAPS KEPS workshop , 2021

work page 2021

[4] [4]

Grammar prompting for domain-specific lan- guage generation with large language models,

B. Wang, Z. Wang, X. Wang, Y . Cao, R. A Saurous, and Y . Kim, “Grammar prompting for domain-specific lan- guage generation with large language models,” Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024

[5] [5]

Large language models as planning domain generators,

J. Oswald, K. Srinivas, H. Kokel, J. Lee, M. Katz, and S. Sohrabi, “Large language models as planning domain generators,” Proceedings of the International Conference on Automated Planning and Scheduling , vol. 34, pp. 423–431, May 2024. [Online]. Available: https: //ojs.aaai.org/index.php/ICAPS/article/view/31502

work page 2024

[6] [6]

Challenges in modelling and solving plotting with pddl,

J. Espasa, I. Miguel, P. Nightingale, A. Z. Salamon, and M. Villaret, “Challenges in modelling and solving plotting with pddl,” 2023. [Online]. Available: https://arxiv.org/abs/2310.01470

work page arXiv 2023

[7] [7]

Dkprompt: Domain knowledge prompting vision-language models for open-world planning,

X. Zhang, Z. Altaweel, Y . Hayamizu, Y . Ding, S. Amiri, H. Yang, A. Kaminski, C. Esselink, and S. Zhang, “Dkprompt: Domain knowledge prompting vision-language models for open-world planning,” 2024. [Online]. Available: https://arxiv.org/abs/2406.17659

work page arXiv 2024

[8] [8]

Do as i can, not as i say: Grounding language in robotic affordances,

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Haus- man, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Retti...

work page 2022

[9] [9]

Look before you leap: Unveiling the power of gpt-4v in robotic vision- language planning,

Y . Hu, F. Lin, T. Zhang, L. Yi, and Y . Gao, “Look before you leap: Unveiling the power of gpt-4v in robotic vision- language planning,” arXiv preprint arXiv:2311.17842 , 2023

work page arXiv 2023

[10] [10]

Interactive task planning with language models,

B. Li, P. Wu, P. Abbeel, and J. Malik, “Interactive task planning with language models,” 2023

work page 2023

[11] [11]

Hier- archical planning for long-horizon manipulation with geometric and symbolic scene graphs,

Y . Zhu, J. Tremblay, S. Birchfield, and Y . Zhu, “Hier- archical planning for long-horizon manipulation with geometric and symbolic scene graphs,” 2021

work page 2021

[12] [12]

Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,

Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, C. Gan, C. M. de Melo, J. B. Tenenbaum, A. Torralba, F. Shkurti, and L. Paull, “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” 2023

work page 2023

[13] [13]

Reasoning with scene graphs for robot planning under partial observability,

S. Amiri, K. Chandan, and S. Zhang, “Reasoning with scene graphs for robot planning under partial observability,” 2022

work page 2022

[14] [14]

Sayplan: Grounding large language models using 3d scene graphs for scalable task planning,

K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf, “Sayplan: Grounding large language models using 3d scene graphs for scalable task planning,” in 7th Annual Conference on Robot Learning , 2023. [Online]. Available: https: //openreview.net/forum?id=wMpOMO0Ss7a

work page 2023

[15] [15]

On the planning abilities of large language models-a critical investigation,

K. Valmeekam, M. Marquez, S. Sreedharan, and S. Kambhampati, “On the planning abilities of large language models-a critical investigation,” Advances in Neural Information Processing Systems , vol. 36, pp. 75 993–76 005, 2023

work page 2023

[16] [16]

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma, M. S. Bernstein, and F.-F. Li, “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations,” Feb. 2016. [Online]. Available: http://arxiv.org/abs/1602.07332

work page internal anchor Pith review Pith/arXiv arXiv 2016

[17] [17]

Image generation from scene graphs,

J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from scene graphs,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2018, pp. 1219–1228

work page 2018

[18] [18]

High-quality image generation from scene graphs with transformer,

X. Zhao, L. Wu, X. Chen, and B. Gong, “High-quality image generation from scene graphs with transformer,” in 2022 IEEE International Conference on Multimedia and Expo (ICME) , 2022, pp. 1–6

work page 2022

[19] [19]

Interactive Image Generation Using Scene Graphs

G. Mittal, S. Agrawal, A. Agarwal, S. Mehta, and T. Marwah, “Interactive image generation using scene graphs,” arXiv preprint arXiv:1905.03743 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[20] [20]

Using Scene Graph Context to Improve Image Generation

S. Tripathi, A. Bhiwandiwalla, A. Bastidas, and H. Tang, “Using scene graph context to improve image generation,” arXiv preprint arXiv:1901.03762 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[21] [21]

Image gen- eration from layout,

B. Zhao, L. Meng, W. Yin, and L. Sigal, “Image gen- eration from layout,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8584–8593

work page 2019

[22] [22]

Diffusion-based scene graph to image generation with masked contrastive pre-training,

L. Yang, Z. Huang, Y . Song, S. Hong, G. Li, W. Zhang, B. Cui, B. Ghanem, and M.-H. Yang, “Diffusion-based scene graph to image generation with masked contrastive pre-training,” arXiv preprint arXiv:2211.11138 , 2022

work page arXiv 2022

[23] [23]

Image captioning with scene-graph based semantic concepts,

L. Gao, B. Wang, and W. Wang, “Image captioning with scene-graph based semantic concepts,” in Proceedings of the 2018 10th international conference on machine learning and computing , 2018, pp. 225–229

work page 2018

[24] [24]

Comprehensive image captioning via scene graph de- composition,

Y . Zhong, L. Wang, J. Chen, D. Yu, and Y . Li, “Comprehensive image captioning via scene graph de- composition,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16 . Springer, 2020, pp. 211–229

work page 2020

[25] [25]

Transforming visual scene graphs to image captions,

X. Yang, J. Peng, Z. Wang, H. Xu, Q. Ye, C. Li, M. Yan, F. Huang, Z. Li, and Y . Zhang, “Transforming visual scene graphs to image captions,” arXiv preprint arXiv:2305.02177, 2023

work page arXiv 2023

[26] [26]

An empirical study on leveraging scene graphs for visual question answering,

C. Zhang, W.-L. Chao, and D. Xuan, “An empirical study on leveraging scene graphs for visual question answering,” arXiv preprint arXiv:1907.12133 , 2019

work page arXiv 1907

[27] [27]

Visual question answering over scene graph,

S. Lee, J.-W. Kim, Y . Oh, and J. H. Jeon, “Visual question answering over scene graph,” in 2019 First International Conference on Graph Computing (GC) , 2019, pp. 45–50

work page 2019

[28] [28]

Understanding the role of scene graphs in visual question answering,

V . Damodaran, S. Chakravarthy, A. Kumar, A. Umapa- thy, T. Mitamura, Y . Nakashima, N. Garcia, and C. Chu, “Understanding the role of scene graphs in visual question answering,” arXiv preprint arXiv:2101.05479 , 2021

work page arXiv 2021

[29] [29]

A deep learn- ing based behavioral approach to indoor autonomous navigation,

G. Sepulveda, J. C. Niebles, and A. Soto, “A deep learn- ing based behavioral approach to indoor autonomous navigation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA) , 2018, pp. 4646–4653

work page 2018

[30] [30]

Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks,

Z. Ravichandran, L. Peng, N. Hughes, J. D. Griffith, and L. Carlone, “Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks,” in 2022 International Conference on Robotics and Automation (ICRA), 2022, pp. 9272–9279

work page 2022

[31] [31]

Grid: Scene-graph-based instruction- driven robotic task planning,

Z. Ni, X.-X. Deng, C. Tai, X.-Y . Zhu, X. Wu, Y .-J. Liu, and L. Zeng, “Grid: Scene-graph-based instruction- driven robotic task planning,” 2023

work page 2023

[32] [32]

Scene graph for embodied exploration in cluttered scenario,

Y . Deng, Q. Sima, D. Guo, H. Liu, Y . Wang, and F. Sun, “Scene graph for embodied exploration in cluttered scenario,” 2023

work page 2023

[33] [33]

Sg-bot: Object rearrangement via coarse-to-fine robotic imagination on scene graphs,

G. Zhai, X. Cai, D. Huang, Y . Di, F. Manhardt, F. Tombari, N. Navab, and B. Busam, “Sg-bot: Object rearrangement via coarse-to-fine robotic imagination on scene graphs,” 2023

work page 2023

[34] [34]

Sequential Manipulation Planning on Scene Graph,

Z. Jiao, Y . Niu, Z. Zhang, S.-C. Zhu, Y . Zhu, and H. Liu, “Sequential Manipulation Planning on Scene Graph,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . Kyoto, Japan: IEEE, Oct. 2022, pp. 8203–8210. [Online]. Available: https://ieeexplore.ieee.org/document/9981735/

work page arXiv 2022

[35] [35]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Mistral 7B

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al. , “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

PaLM 2 Technical Report

R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lep- ikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Gpt-4 technical report,

OpenAI, :, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V . Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L. Brakman, G. Brockman, T. Brooks, M. Bru...

work page 2024

[39] [39]

Vipergpt: Visual inference via python execution for reasoning,

D. Surís, S. Menon, and C. V ondrick, “Vipergpt: Visual inference via python execution for reasoning,” 2023

work page 2023

[40] [40]

Improved Baselines with Visual Instruction Tuning

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved base- lines with visual instruction tuning,” arXiv preprint arXiv:2310.03744, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Decomplex: Task planning from complex natural instructions by a collocating robot,

P. Pramanick, H. B. Barua, and C. Sarkar, “Decomplex: Task planning from complex natural instructions by a collocating robot,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 2020, pp. 6894–6901

work page 2020

[42] [42]

Translating natural language instructions to computer programs for robot manipulation,

S. G. Venkatesh, R. Upadrashta, and B. Amrutur, “Translating natural language instructions to computer programs for robot manipulation,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 1919–1926

work page 2021

[43] [43]

Inner monologue: Embod- ied reasoning through planning with language models,

W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Flo- rence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, P. Sermanet, N. Brown, T. Jackson, L. Luu, S. Levine, K. Hausman, and B. Ichter, “Inner monologue: Embod- ied reasoning through planning with language models,” 2022

work page 2022

[44] [44]

ReAct: Synergizing Reasoning and Acting in Language Models

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” arXiv preprint arXiv:2210.03629 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[45] [45]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn et al. , “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

Copal: Corrective planning of robot actions with large language models,

F. Joublin, A. Ceravola, P. Smirnov, F. Ocker, J. Deigmoeller, A. Belardinelli, C. Wang, S. Hasler, D. Tanneberg, and M. Gienger, “Copal: Corrective planning of robot actions with large language models,” 2023

work page 2023

[48] [48]

Is feedback all you need? leveraging natural language feedback in goal-conditioned reinforcement learning,

S. McCallum, M. Taylor-Davies, S. V . Albrecht, and A. Suglia, “Is feedback all you need? leveraging natural language feedback in goal-conditioned reinforcement learning,” 2023

work page 2023

[49] [49]

Reflect: Summarizing robot experiences for failure explanation and correction,

Z. Liu, A. Bahety, and S. Song, “Reflect: Summarizing robot experiences for failure explanation and correction,” 2023

work page 2023

[50] [50]

V oyager: An open-ended embodied agent with large language models,

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An open-ended embodied agent with large language models,” 2023

work page 2023

[51] [51]

Openai (2023),

OpenAI, “Openai (2023),” 2023. [Online]. Available: https://openai.com/index/gpt-4v-system-card/

work page 2023

[52] [52]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, D. Silver, M. Johnson, I. Antonoglou, J. Schrittwieser, A. Glaese, J. Chen, E. Pitler, T. Lillicrap, A. Lazaridou, O. Firat, J. Molloy, M. Isard, P. R. Barham, T. Hennigan, B. Lee, F. Viola, M. Reynolds, Y . Xu, R. Doherty, E. Collins, C. Meyer...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” 2023

work page 2023

[54] [54]

[Online]

github, “Github,” 2024. [Online]. Available: https: //github.com/ollama/ollama

work page 2024