CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates

Akash Kumar; Priyank Pathak; Shresth Grover; Yogesh S Rawat

read the original abstract

Vision Language Models (VLMs) have shown promising planning capabilities, yet their success remains confined to the text domain, leaving visual decision-making relatively underexplored. Addressing this gap, we introduce Corrective Sequence Planning (CoSPlan) benchmark, where VLMs must plan a sequence of visual actions from an initial scene to a target scene. CoSPlan evaluates models on their ability to imagine and execute a coherent set of visual steps required to reach the goal (Step Completion). To prevent any shortcuts that simply describe the final scene, we introduce an erroneous action in decision-making, which must be detected (Error Detection) and corrected to reach the goal, enabling a deeper understanding of the task. CoSPlan spans across 4 tasks: maze navigation, block re-arrangement, image reconstruction, and object re-organization. Despite using advanced reasoning strategies such as Chain-of-Thought and Scene Graphs, VLMs struggle on CoSPlan, while still showing promising performance in the text domain. Addressing this, we propose Scene Graph Incremental updates (SGI), a novel training-free method to transform images into `textual' scene graphs, enabling step-by-step reasoning through iterative scene graph refinement. SGI yields an average of ~4.4% improvement on CoSPlan w/ generalization on PlanBench and VQA. Link for solving puzzles on the project page.

CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates

discussion (0)