pith. sign in

arxiv: 2512.10342 · v3 · pith:QI642LN6new · submitted 2025-12-11 · 💻 cs.CV

CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates

classification 💻 cs.CV
keywords scenecosplangraphplanningvisualvlmsaddressingcorrective
0
0 comments X
read the original abstract

Vision Language Models (VLMs) have shown promising planning capabilities, yet their success remains confined to the text domain, leaving visual decision-making relatively underexplored. Addressing this gap, we introduce Corrective Sequence Planning (CoSPlan) benchmark, where VLMs must plan a sequence of visual actions from an initial scene to a target scene. CoSPlan evaluates models on their ability to imagine and execute a coherent set of visual steps required to reach the goal (Step Completion). To prevent any shortcuts that simply describe the final scene, we introduce an erroneous action in decision-making, which must be detected (Error Detection) and corrected to reach the goal, enabling a deeper understanding of the task. CoSPlan spans across 4 tasks: maze navigation, block re-arrangement, image reconstruction, and object re-organization. Despite using advanced reasoning strategies such as Chain-of-Thought and Scene Graphs, VLMs struggle on CoSPlan, while still showing promising performance in the text domain. Addressing this, we propose Scene Graph Incremental updates (SGI), a novel training-free method to transform images into `textual' scene graphs, enabling step-by-step reasoning through iterative scene graph refinement. SGI yields an average of ~4.4% improvement on CoSPlan w/ generalization on PlanBench and VQA. Link for solving puzzles on the project page.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.