AERMANI-VLM: Structured Prompting and Reasoning for Aerial Manipulation with Vision Language Models

Avirup Das; Rishabh Dev Yadav; Saksham Gupta; Sarthak Mishra; Spandan Roy; Wei Pan

arxiv: 2511.01472 · v2 · pith:AJ7RNTK2new · submitted 2025-11-03 · 💻 cs.RO

AERMANI-VLM: Structured Prompting and Reasoning for Aerial Manipulation with Vision Language Models

Sarthak Mishra , Rishabh Dev Yadav , Avirup Das , Saksham Gupta , Wei Pan , Spandan Roy This is my paper

classification 💻 cs.RO

keywords reasoninglanguageaerialaermani-vlmframeworknaturalactioncommands

0 comments

read the original abstract

The rapid progress of vision--language models (VLMs) has sparked growing interest in robotic control, where natural language can express the operation goals while visual feedback links perception to action. However, directly deploying VLM-driven policies on aerial manipulators remains unsafe and unreliable since the generated actions are often inconsistent, hallucination-prone, and dynamically infeasible for flight. In this work, we present AERMANI-VLM, the first framework to adapt pretrained VLMs for aerial manipulation by separating high-level reasoning from low-level control, without any task-specific fine-tuning. Our framework encodes natural language instructions, task context, and safety constraints into a structured prompt that guides the model to generate a step-by-step reasoning trace in natural language. This reasoning output is used to select from a predefined library of discrete, flight-safe skills, ensuring interpretable and temporally consistent execution. By decoupling symbolic reasoning from physical action, AERMANI-VLM mitigates hallucinated commands and prevents unsafe behavior, enabling robust task completion. We validate the framework in both simulation and hardware on diverse multi-step pick-and-place tasks, demonstrating strong generalization to previously unseen commands, objects, and environments.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses
cs.CR 2026-03 unverdicted novelty 6.0

The survey organizes over 400 papers on embodied AI safety into a multi-level taxonomy and flags overlooked issues such as fragile multimodal fusion and unstable planning under jailbreaks.
Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap
cs.RO 2026-04 unverdicted novelty 4.0

A survey of UAV vision-and-language navigation that establishes a methodological taxonomy, reviews resources and challenges, and proposes a forward-looking research roadmap.
Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models
cs.RO 2026-04 unverdicted novelty 4.0

This survey organizes aerial vision-language navigation methods into five architectural categories, critically reviews evaluation infrastructure, and synthesizes seven open problems for LLM/VLM integration.
A Universal Large Language Model -- Drone Command and Control Interface
cs.RO 2026-01 unverdicted novelty 4.0

A universal LLM-to-drone interface is implemented via the Model Context Protocol (MCP) and Mavlink, demonstrated with real UAV flight control and simulated flights using live map data.