CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates
read the original abstract
Vision Language Models (VLMs) have shown promising planning capabilities, yet their success remains confined to the text domain, leaving visual decision-making relatively underexplored. Addressing this gap, we introduce Corrective Sequence Planning (CoSPlan) benchmark, where VLMs must plan a sequence of visual actions from an initial scene to a target scene. CoSPlan evaluates models on their ability to imagine and execute a coherent set of visual steps required to reach the goal (Step Completion). To prevent any shortcuts that simply describe the final scene, we introduce an erroneous action in decision-making, which must be detected (Error Detection) and corrected to reach the goal, enabling a deeper understanding of the task. CoSPlan spans across 4 tasks: maze navigation, block re-arrangement, image reconstruction, and object re-organization. Despite using advanced reasoning strategies such as Chain-of-Thought and Scene Graphs, VLMs struggle on CoSPlan, while still showing promising performance in the text domain. Addressing this, we propose Scene Graph Incremental updates (SGI), a novel training-free method to transform images into `textual' scene graphs, enabling step-by-step reasoning through iterative scene graph refinement. SGI yields an average of ~4.4% improvement on CoSPlan w/ generalization on PlanBench and VQA. Link for solving puzzles on the project page.
This paper has not been read by Pith yet.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.