Robot Planning and Situation Handling with Active Perception

Austine Oloo , Zainab Altaweel , Yohei Hayamizu , Peiqi Liu , Yan Ding , Saeid Amiri , Hao Yang , Andy Kaminski

show 4 more authors

Chad Esselink Chris Paxton Xiaohan Zhang Shiqi Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 12:21 UTC · model grok-4.3

classification 💻 cs.RO

keywords situationsexecutionplanningrobotsvap-tampactionactiveduring

0 comments

The pith

VAP-TAMP combines action knowledge, vision-language models for active view selection, and scene-graph reasoning to let robots perceive and resolve unforeseen execution-time situations during task and motion planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Robots often fail when something unexpected happens, such as a door jamming or an object falling. VAP-TAMP tries to fix this by letting the robot actively choose what to look at next using a vision-language model prompted by its own action knowledge. It then builds a scene graph to reason about both the high-level task and the low-level motions needed to recover. The system was tested on service tasks both in simulation and on a real mobile manipulator.

Core claim

we develop a planning and situation-handling framework, called VAP-TAMP, that enables robots to actively perceive and address unforeseen situations during plan execution. VAP-TAMP leverages action knowledge to strategically prompt vision-language models for active view selection and situation assessment, while constructing and reasoning over scene graphs for integrated task and motion planning.

Load-bearing premise

That prompting vision-language models with action knowledge will produce reliable situation assessments and that scene graphs can be constructed and reasoned over in real time without excessive error propagation during recovery.

read the original abstract

Current robots are capable of computing plans to accomplish complex tasks. However, real-world environments are inherently open and dynamic, and unforeseen situations frequently arise during plan execution, such as jamming doors and fallen objects on the floor. These situations may result from the robot's own action failures or from external disturbances, such as human activities. Detecting and handling such execution - time situations remains a significant challenge, limiting those robots' ability to achieve long-term autonomy. In this paper, we develop a planning and situation-handling framework, called VAP-TAMP, that enables robots to actively perceive and address unforeseen situations during plan execution. VAP-TAMP leverages action knowledge to strategically prompt vision-language models for active view selection and situation assessment, while constructing and reasoning over scene graphs for integrated task and motion planning. We evaluated VAP-TAMP using service tasks in simulation and on a mobile manipulation platform.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VAP-TAMP integrates VLM prompting with scene-graph TAMP for execution-time recovery, but the hardware and sim results leave the VLM reliability question open.

read the letter

The core contribution is a loop that feeds action knowledge into VLMs to pick views and label situations like jammed doors or fallen objects, then rebuilds scene graphs to let TAMP replan. That combination is the new piece; prior active-perception work and scene-graph planners exist separately, but tying the VLM step directly to recovery via action prompts is a concrete integration for service tasks. The authors did run both simulation and real mobile-manipulation trials, which is better than pure sim papers and shows they tried to close the loop on hardware. That part earns credit for effort on a recognized pain point in long-horizon autonomy. The soft spot is exactly the one the stress-test flags: the paper does not isolate how often the VLM assessments are wrong or how those errors propagate into bad scene-graph updates and failed recoveries. Without numbers on failure rates, viewpoint sensitivity, or comparison to simpler perception baselines, it is hard to know whether the system works because the prompting is robust or because the test cases stayed easy. The abstract mentions evaluation but gives no quantitative breakdown, so the central robustness claim stays under-supported. This is useful reading for groups already working on VLM-robotics hybrids or scene-graph TAMP; they can see one way to wire the pieces together and decide if the prompting trick is worth trying. It is not ready to cite yet for anyone who needs proven recovery rates. I would send it to review because the integration idea is clear enough and the hardware attempt is real, even if the analysis needs tightening on error modes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The framework implicitly assumes reliable VLM outputs and tractable scene-graph construction, but these are not formalized.

pith-pipeline@v0.9.0 · 5483 in / 1038 out tokens · 46977 ms · 2026-05-07T12:21:42.476133+00:00 · methodology

Robot Planning and Situation Handling with Active Perception

Core claim

Load-bearing premise

discussion (0)