See&Say: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones

Mahyar Ghazanfari; Peng Wei

arxiv: 2604.13292 · v1 · submitted 2026-04-14 · 💻 cs.CV

See&Say: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones

Mahyar Ghazanfari , Peng Wei This is my paper

Pith reviewed 2026-05-10 15:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords drone deliverysafe zone detectionvision-language modeldepth fusionhazard detectionsafety mapsurban environmentsautonomous systems

0 comments

The pith

Vision-language guidance fuses depth and detection to create reliable safety maps for drone package drops.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes See&Say, a framework that combines geometric cues from monocular depth with semantic information from open-vocabulary detection, all refined iteratively by a vision-language model. This setup addresses the difficulty of picking safe drop zones for autonomous drones in cluttered urban areas where primary landing spots may be blocked by people or objects. If the integration works as described, drones gain the ability to spot hazards dynamically and suggest backup zones during the final approach phase. A reader would care because current geometry-only or segmentation-only methods often fail to reason about context in changing environments with moving activity.

Core claim

See&Say fuses monocular depth gradients with open-vocabulary detection masks to produce safety maps, while the vision-language model dynamically adjusts object category prompts and refines hazard detection across time. When the primary drop area is occupied or unsafe, the system identifies alternative candidate zones. On a curated dataset of urban delivery scenarios with moving objects and human activity, the approach records the highest accuracy and IoU for safety map prediction and stronger results in alternative zone evaluation across thresholds compared with baselines.

What carries the argument

Fusion of monocular depth gradients with open-vocabulary detection masks, guided by a vision-language model for iterative prompt adjustment and hazard refinement.

If this is right

Drones can produce more accurate safety maps for package drop decisions in cluttered settings.
Alternative drop zones become available when the primary pad is occupied or unsafe.
Performance holds across multiple evaluation thresholds for zone selection.
The final delivery phase gains robustness under time-varying conditions.
Integrated semantic and geometric reasoning outperforms isolated geometry or segmentation approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion pattern could apply to ground robots needing to choose safe stopping spots in crowds.
Open-vocabulary detection reduces dependence on hand-curated lists of hazards.
Real-time versions might feed directly into onboard flight controllers for live replanning.
Dataset collection focused on moving urban elements could serve as a benchmark for related perception tasks.

Load-bearing premise

The vision-language model can reliably and dynamically adjust object category prompts and refine hazard detection across time in dynamic urban conditions with moving objects and human activities.

What would settle it

A sequence of real urban drone footage in which moving pedestrians or changing lighting cause the generated safety map to flag an actually safe zone as hazardous or miss a clear hazard, yielding accuracy and IoU no better than depth-only or segmentation-only baselines.

Figures

Figures reproduced from arXiv: 2604.13292 by Mahyar Ghazanfari, Peng Wei.

**Figure 1.** Figure 1: Overview of the proposed See&Say framework. The system takes batches of five RGB frames and corresponding monocular depth maps as input. Depth gradients provide geometric cues for flatness and obstacle detection, while DINO-X produces open-vocabulary semantic hazard masks. These initial maps are fused to form a preliminary safety overlay. A VLM refines detection prompts using temporal RGB–depth context, im… view at source ↗

**Figure 2.** Figure 2: The VLM refines the object categories used by DINO-X [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The second VLM agent in See&Say incorporates human preferences into package drop zone selection, operating after the initial safety map is generated by the first VLM agent. where w,h are the bounding-box sides; otherwise we use a default r. For each candidate c, let S_c be its disk support and B^{\text {final}}_t the final unsafe mask. We define the safe ratio \mathrm {safe}(c)=1-\frac {1}{|S_c|}\sum _{(x,… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of pipeline stages across three scenes. Columns: (a) RGB input, (b) monocular depth, (c) initial DINOx [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: ROC curves (top row) and Precision–Recall curves (bottom row) shown across thresholds from left to right. True positive rate [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Score distributions for human preference evaluation [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Autonomous drone delivery systems are rapidly advancing, but ensuring safe and reliable package drop-offs remains highly challenging in cluttered urban and suburban environments where accurately identifying suitable package drop zones is critical. Existing approaches typically rely on either geometry-based analysis or semantic segmentation alone, but these methods lack the integrated semantic reasoning required for robust decision-making. To address this gap, we propose See&Say, a novel framework that combines geometric safety cues with semantic perception, guided by a Vision-Language Model (VLM) for iterative refinement. The system fuses monocular depth gradients with open-vocabulary detection masks to produce safety maps, while the VLM dynamically adjusts object category prompts and refines hazard detection across time, enabling reliable reasoning under dynamic conditions during the final delivery phase. When the primary drop-pad is occupied or unsafe, the proposed See&Say also identifies alternative candidate zones for package delivery. We curated a dataset of urban delivery scenarios with moving objects and human activities to evaluate the approach. Experimental results show that See&Say outperforms all baselines, achieving the highest accuracy and IoU for safety map prediction as well as superior performance in alternative drop zone evaluation across multiple thresholds. These findings highlight the promise of VLM-guided segmentation-depth fusion for advancing safe and practical drone-based package delivery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes See&Say, a framework for identifying safe package drop zones for autonomous delivery drones in cluttered urban environments. It fuses monocular depth gradients with open-vocabulary segmentation masks and uses a Vision-Language Model (VLM) for iterative prompt adjustment and hazard refinement across time to handle dynamic conditions such as moving objects and human activity. When the primary zone is unsafe, the system identifies alternative candidate zones. The authors curate a custom dataset of urban scenarios and report that See&Say outperforms all baselines on accuracy and IoU for safety map prediction as well as on alternative-zone metrics across multiple thresholds.

Significance. If the empirical claims can be substantiated with rigorous validation, the work offers a practical integration of geometric cues, open-vocabulary detection, and VLM reasoning for safety-critical drone decisions. The emphasis on dynamic urban conditions and alternative-zone fallback addresses a concrete deployment gap in autonomous delivery systems.

major comments (3)

[§4] §4 (Experimental Results): The abstract asserts that See&Say achieves the highest accuracy and IoU for safety map prediction plus superior alternative-zone performance, yet supplies no baseline definitions, dataset size, number of sequences, error bars, or statistical tests. This absence prevents assessment of the central outperformance claim.
[§3 and §4] §3 (Method) and §4 (Experiments): No ablation is reported that isolates the VLM iterative refinement and dynamic prompt adjustment from the underlying depth-gradient + open-vocabulary mask fusion. Without this isolation, especially on sequences containing motion, it remains unclear whether the reported gains are attributable to the VLM component emphasized in the abstract.
[§4] §4 (Experiments): The manuscript provides no quantitative analysis of VLM prompt stability, false-negative reduction, or failure cases on moving objects and human activities, which are the exact conditions cited as motivation for the VLM-guided approach.

minor comments (1)

[Abstract] Abstract: Consider adding one sentence on the specific VLM backbone and the number of baselines compared to give readers immediate context for the claimed superiority.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will incorporate the suggested improvements in the revised manuscript to strengthen the empirical validation.

read point-by-point responses

Referee: [§4] §4 (Experimental Results): The abstract asserts that See&Say achieves the highest accuracy and IoU for safety map prediction plus superior alternative-zone performance, yet supplies no baseline definitions, dataset size, number of sequences, error bars, or statistical tests. This absence prevents assessment of the central outperformance claim.

Authors: We agree that the current presentation lacks sufficient detail for rigorous evaluation. In the revised manuscript we will expand §4 to explicitly define all baselines, report the exact size of the curated urban dataset (including number of scenarios, sequences, and frames), provide error bars from multiple runs or cross-validation, and include statistical significance tests (e.g., paired t-tests or Wilcoxon tests) comparing See&Say against baselines. The abstract will be updated if necessary to reference these additions. revision: yes
Referee: [§3 and §4] §3 (Method) and §4 (Experiments): No ablation is reported that isolates the VLM iterative refinement and dynamic prompt adjustment from the underlying depth-gradient + open-vocabulary mask fusion. Without this isolation, especially on sequences containing motion, it remains unclear whether the reported gains are attributable to the VLM component emphasized in the abstract.

Authors: We acknowledge the value of isolating the VLM contribution. We will add a dedicated ablation study in the revised §4 that compares the full See&Say pipeline (with VLM iterative refinement and dynamic prompt adjustment) against the base depth-gradient + open-vocabulary mask fusion without the VLM. Results will be reported separately on static and motion-containing sequences to quantify the incremental benefit of the VLM component. revision: yes
Referee: [§4] §4 (Experiments): The manuscript provides no quantitative analysis of VLM prompt stability, false-negative reduction, or failure cases on moving objects and human activities, which are the exact conditions cited as motivation for the VLM-guided approach.

Authors: We agree that quantitative characterization of the VLM's role under dynamic conditions is needed. In the revision we will add metrics for prompt stability across frames, quantitative false-negative reduction on moving objects and humans, and a breakdown of failure cases with examples drawn from the dataset. These will be presented in §4 alongside the existing results. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript describes an applied vision-language system for drone drop-zone detection that fuses monocular depth with open-vocabulary masks and uses a VLM for prompt refinement. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All performance claims are empirical (accuracy, IoU, alternative-zone detection) evaluated on a curated dataset; none reduce by construction to the inputs or to prior self-authored results. The work is self-contained as a descriptive engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on domain assumptions about VLM capabilities rather than new free parameters or invented entities; no quantitative fitting or ad-hoc constants are described.

axioms (1)

domain assumption Vision-language models can effectively reason about scene safety and dynamically adjust prompts for hazard detection in real time
Invoked when describing VLM guidance for iterative refinement during the final delivery phase.

pith-pipeline@v0.9.0 · 5523 in / 1209 out tokens · 37767 ms · 2026-05-10T15:46:01.794703+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 2 internal anchors

[1]

Safe landing zones de- tection for UA Vs using deep regression

Sakineh Abdollahzadeh, Pier-Luc Proulx, Mohand Said Allili, and Jean-François Lapointe. Safe landing zones de- tection for UA Vs using deep regression. In2022 19th Con- ference on Robots and Vision (CRV), pages 213–218. IEEE,

work page
[2]

Real-time multi-modal semantic fusion on unmanned aerial vehicles

Simon Bultmann, Jan Quenzel, and Sven Behnke. Real-time multi-modal semantic fusion on unmanned aerial vehicles. In2021 European Conference on Mobile Robots (ECMR), pages 1–8. IEEE, 2021. 2

work page 2021
[3]

Visdrone-det2021: The vision meets drone object detection challenge results

Yaru Cao, Zhijian He, Lujia Wang, Wenguan Wang, Yix- uan Yuan, Dingwen Zhang, Jinglin Zhang, Pengfei Zhu, Luc Van Gool, Junwei Han, et al. Visdrone-det2021: The vision meets drone object detection challenge results. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 2847–2854, 2021. 5

work page 2021
[4]

Robust autonomous landing of UA V in non- cooperative environments based on dynamic time camera- lidar fusion.arXiv:2011.13761, 2020

Lyujie Chen, Xiaming Yuan, Yao Xiao, Yiding Zhang, and Jihong Zhu. Robust autonomous landing of UA V in non- cooperative environments based on dynamic time camera- lidar fusion.arXiv:2011.13761, 2020. 1, 2

work page arXiv 2011
[5]

Yolo-world: Real-time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 16901–16911, 2024. 8

work page 2024
[6]

Vision-Based Risk Aware Emergency Landing for UAVs in Complex Urban Environments

Julio de la Torre-Vanegas, Miguel Soriano-Garcia, Israel Be- cerra, and Diego Mercado-Ravell. Vision-based risk aware emergency landing for UA Vs in complex urban environ- ments.arXiv:2505.20423, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Package delivery based on the leader-follower control paradigm for multirobot systems

Emanuele dos Santos Cardoso, Vinícius Pacheco Bacheti, and Mário Sarcinelli-Filho. Package delivery based on the leader-follower control paradigm for multirobot systems. InInternational Conference on Unmanned Aircraft Systems (ICUAS), pages 775–781. IEEE, 2023. 1

work page 2023
[8]

Mid-air: A multi-modal dataset for extremely low altitude drone flights

Michael Fonder and Marc Van Droogenbroeck. Mid-air: A multi-modal dataset for extremely low altitude drone flights. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 553–562, 2019. 4

work page 2019
[9]

Semantic drone dataset (dronedataset)

Institute of Computer Graphics and Vision (ICG), Graz Uni- versity of Technology (TU Graz). Semantic drone dataset (dronedataset). [Online]. Available: http://dronedataset.icg. tugraz.at/, 2019. 4

work page 2019
[10]

Ultralytics YOLOv8, 2023

Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralytics YOLOv8, 2023. 5

work page 2023
[11]

Weather-aware drone-view object detection via envi- ronmental context understanding

Hyunjun Kim, Dahye Lee, Sungjune Park, and Yong Man Ro. Weather-aware drone-view object detection via envi- ronmental context understanding. In2024 IEEE Interna- tional Conference on Image Processing (ICIP), pages 549– 555, 2024. 2

work page 2024
[12]

Image segmentation to identify safe landing zones for unmanned aerial vehicles

Joe Kinahan and Alan F Smeaton. Image segmentation to identify safe landing zones for unmanned aerial vehicles. Irish Conference on Artificial Intelligence and Cognitive Sci- ence (AICS), pages 235–247, 2021. 1, 2

work page 2021
[13]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1): 159–174, 1977. 15

work page 1977
[14]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755. Springer, 2014. 5

work page 2014
[15]

A real- time and multi-sensor-based landing area recognition system for UA Vs.Drones, 6(5):118, 2022

Fei Liu, Jiayao Shan, Binyu Xiong, and Zheng Fang. A real- time and multi-sensor-based landing area recognition system for UA Vs.Drones, 6(5):118, 2022. 1

work page 2022
[16]

SafeUA V: Learn- ing to estimate depth and safe landing areas for UA Vs from synthetic data

Alina Marcu, Dragos Costea, Vlad Licaret, Mihai Pîrvu, Emil Slusanschi, and Marius Leordeanu. SafeUA V: Learn- ing to estimate depth and safe landing areas for UA Vs from synthetic data. InProceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 43–58, 2018. 2

work page 2018
[17]

Light-weight approach for safe landing in populated areas

Tilemachos Mitroudas, Vasiliki Balaska, Athanasios Pso- moulis, and Antonios Gasteratos. Light-weight approach for safe landing in populated areas. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 10027–10032, 2024. 1

work page 2024
[18]

Autonomous UA V mission cycling: A mobile hub approach for precise landings and continuous operations in challenging environ- ments

Alexander Moortgat-Pick, Marie Schwahn, Anna Adam- czyk, Daniel A Duecker, and Sami Haddadin. Autonomous UA V mission cycling: A mobile hub approach for precise landings and continuous operations in challenging environ- ments. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 8450–8456, 2024. 1

work page 2024
[19]

Vi- sion transformers for dense prediction

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12179–12188, 2021. 2

work page 2021
[20]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv:2408.00714, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

You only look once: Unified, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 779– 788, 2016. 5

work page 2016
[22]

Dino-x: A unified vision model for open-world object detection and understanding, 2024

Tianhe Ren, Yihao Chen, Qing Jiang, Zhaoyang Zeng, Yuda Xiong, Wenlong Liu, Zhengyu Ma, Junyi Shen, Yuan Gao, Xiaoke Jiang, Xingyu Chen, Zhuheng Song, Yuhong Zhang, Hongjie Huang, Han Gao, Shilong Liu, Hao Zhang, Feng Li, Kent Yu, and Lei Zhang. Dino-x: A unified vision model for open-world object detection and understanding. arXiv:2411.14347, 2024. 2

work page arXiv 2024
[23]

Dynamic texts from UA V perspective natural images

Hidetomo Sakaino. Dynamic texts from UA V perspective natural images. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 2070– 2081, 2023. 2

work page 2070
[24]

Practical and safe navigation function based motion plan- ning of UA Vs

Himani Sinhmar, Marcus Greiff, and Stefano Di Cairano. Practical and safe navigation function based motion plan- ning of UA Vs. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 12186–12192,

work page
[25]

Risk-aware emergency landing planning for gliding aircraft model in ur- ban environments

Jakub Sláma, Jáchym Herynek, and Jan Faigl. Risk-aware emergency landing planning for gliding aircraft model in ur- ban environments. In2023 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), pages 4820– 4826, 2023. 1

work page 2023
[26]

Multi- UA V disaster environment coverage planning with limited- endurance

Hongyu Song, Jincheng Yu, Jiantao Qiu, Zhixiao Sun, Kui- jun Lang, Qing Luo, Yuan Shen, and Yu Wang. Multi- UA V disaster environment coverage planning with limited- endurance. In2022 International Conference on Robotics and Automation (ICRA), pages 10760–10766. IEEE, 2022. 1

work page 2022
[27]

Exploring in-memory accelerators and FP- GAs for latency-sensitive DNN inference on edge servers

Ali Suvizi, Suresh Subramaniam, Tian Lan, and Guru Venkataramani. Exploring in-memory accelerators and FP- GAs for latency-sensitive DNN inference on edge servers. In 2024 IEEE Cloud Summit, pages 1–6, 2024. 8

work page 2024
[28]

Chain- of-thought flight planner: End-to-end llm routing under wind hazards

Amin Tabrizian, Mahyar Ghazanfari, and Peng Wei. Chain- of-thought flight planner: End-to-end llm routing under wind hazards. InAIAA AVIATION FORUM AND ASCEND, page 3711, 2025. 2

work page 2025
[29]

Vis- landing: Monocular 3D perception for UA V safe landing via depth-normal synergy

Zhuoyue Tan, Boyong He, Yuxiang Ji, and Liaoni Wu. Vis- landing: Monocular 3D perception for UA V safe landing via depth-normal synergy. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025. to appear. 1, 2

work page 2025
[30]

The state-of-the-art of human–drone interaction: A survey.IEEE Access, 7: 167438–167454, 2019

Dante Tezza and Marvin Andujar. The state-of-the-art of human–drone interaction: A survey.IEEE Access, 7: 167438–167454, 2019. 1

work page 2019
[31]

UA Vs meet LLMs: Overviews and per- spectives towards agentic low-altitude mobility.Information Fusion, 122:103158, 2025

Yonglin Tian, Fei Lin, Yiduo Li, Tengchao Zhang, Qiyao Zhang, Xuan Fu, Jun Huang, Xingyuan Dai, Yutong Wang, Chunwei Tian, et al. UA Vs meet LLMs: Overviews and per- spectives towards agentic low-altitude mobility.Information Fusion, 122:103158, 2025. 2

work page 2025
[32]

Landing zone detection for MA Vs using depth images and vision transformers

Victoria Eugenia Vazquez-Meza and Jose Martinez- Carranza. Landing zone detection for MA Vs using depth images and vision transformers. InProceedings of the 15th Annual International Micro Air Vehicle Conference and Competition (IMAV 2024), pages 162–169, 2024. 1, 2

work page 2024
[33]

Transformer or CNN? benchmarking real-time detection transformer and YOLOv8 for small UAS autonomous landing

Can X Vu, Mahyar Ghazanfari, Kevin Dong, Abenezer Taye, Amin Tabrizian, and Peng Wei. Transformer or CNN? benchmarking real-time detection transformer and YOLOv8 for small UAS autonomous landing. InAIAA AVIATION FO- RUM AND ASCEND 2025, page 3521, 2025. 1

work page 2025
[34]

Wing website

Wing. Wing website. https://wing.com/, 2025. [Online; accessed Sep. 10, 2025]. 1

work page 2025
[35]

Depth any- thing v2.Advances in Neural Information Processing Sys- tems (NeurIPS), 37:21875–21911, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Advances in Neural Information Processing Sys- tems (NeurIPS), 37:21875–21911, 2024. 1

work page 2024
[36]

DETRs beat YOLOs on real-time object detection

Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. DETRs beat YOLOs on real-time object detection. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16965–16974, 2024. 5

work page 2024
[37]

Zipline website

Zipline. Zipline website. https://www.zipline.com/, 2025. [Online; accessed Sep. 10, 2025]. 1 See&Say: Supplementary Material Vision–Language Guided Safe Zone Detection for Autonomous Package Delivery Drones This document provides supplementary material for the main paper, including full implementation details, all hyper- parameters, complete VLM prompts,...

work page 2025
[38]

Determine if the landing pad issafefor the current frame (true/false). Decide based on thefinalframe and the previous 5 frames: if there are objects on the landing pad, or there will be objects on the landing pad, declare unsafe, otherwise declare safe

work page
[39]

Provide reasoning usingtemporal cuesanddepth infor- mation

work page
[40]

Predict future safety (will conditions remain safe/un- safe?)

work page
[41]

Provide a singleupdated prompt list: include ALL un- safe objects/surfaces; remove safe ones (e.g.landing pad if confirmed safe, bushes, . . . ). The list must reflect the most recent scene. Unsafe objects include any moving or static objects that are not flat, or are moving and not safe for a package drop. If the drop zone with H sign is un- safe, also a...

work page
[42]

ranked": [ {

Determine whether the primary landing pad with an ‘H’ marking is safe for a drop. Setlanding_pad_safe = falseonlyif you can see any object(s)insidethe landing pad area. Otherwise, setlanding_pad_safe = true. If you cannot locate the landing pad, set it tonulland explain. 2.reasoning: 1–3 short sentences describing what you see onthe pad. 3.future_predicti...

work page

[1] [1]

Safe landing zones de- tection for UA Vs using deep regression

Sakineh Abdollahzadeh, Pier-Luc Proulx, Mohand Said Allili, and Jean-François Lapointe. Safe landing zones de- tection for UA Vs using deep regression. In2022 19th Con- ference on Robots and Vision (CRV), pages 213–218. IEEE,

work page

[2] [2]

Real-time multi-modal semantic fusion on unmanned aerial vehicles

Simon Bultmann, Jan Quenzel, and Sven Behnke. Real-time multi-modal semantic fusion on unmanned aerial vehicles. In2021 European Conference on Mobile Robots (ECMR), pages 1–8. IEEE, 2021. 2

work page 2021

[3] [3]

Visdrone-det2021: The vision meets drone object detection challenge results

Yaru Cao, Zhijian He, Lujia Wang, Wenguan Wang, Yix- uan Yuan, Dingwen Zhang, Jinglin Zhang, Pengfei Zhu, Luc Van Gool, Junwei Han, et al. Visdrone-det2021: The vision meets drone object detection challenge results. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 2847–2854, 2021. 5

work page 2021

[4] [4]

Robust autonomous landing of UA V in non- cooperative environments based on dynamic time camera- lidar fusion.arXiv:2011.13761, 2020

Lyujie Chen, Xiaming Yuan, Yao Xiao, Yiding Zhang, and Jihong Zhu. Robust autonomous landing of UA V in non- cooperative environments based on dynamic time camera- lidar fusion.arXiv:2011.13761, 2020. 1, 2

work page arXiv 2011

[5] [5]

Yolo-world: Real-time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 16901–16911, 2024. 8

work page 2024

[6] [6]

Vision-Based Risk Aware Emergency Landing for UAVs in Complex Urban Environments

Julio de la Torre-Vanegas, Miguel Soriano-Garcia, Israel Be- cerra, and Diego Mercado-Ravell. Vision-based risk aware emergency landing for UA Vs in complex urban environ- ments.arXiv:2505.20423, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Package delivery based on the leader-follower control paradigm for multirobot systems

Emanuele dos Santos Cardoso, Vinícius Pacheco Bacheti, and Mário Sarcinelli-Filho. Package delivery based on the leader-follower control paradigm for multirobot systems. InInternational Conference on Unmanned Aircraft Systems (ICUAS), pages 775–781. IEEE, 2023. 1

work page 2023

[8] [8]

Mid-air: A multi-modal dataset for extremely low altitude drone flights

Michael Fonder and Marc Van Droogenbroeck. Mid-air: A multi-modal dataset for extremely low altitude drone flights. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 553–562, 2019. 4

work page 2019

[9] [9]

Semantic drone dataset (dronedataset)

Institute of Computer Graphics and Vision (ICG), Graz Uni- versity of Technology (TU Graz). Semantic drone dataset (dronedataset). [Online]. Available: http://dronedataset.icg. tugraz.at/, 2019. 4

work page 2019

[10] [10]

Ultralytics YOLOv8, 2023

Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralytics YOLOv8, 2023. 5

work page 2023

[11] [11]

Weather-aware drone-view object detection via envi- ronmental context understanding

Hyunjun Kim, Dahye Lee, Sungjune Park, and Yong Man Ro. Weather-aware drone-view object detection via envi- ronmental context understanding. In2024 IEEE Interna- tional Conference on Image Processing (ICIP), pages 549– 555, 2024. 2

work page 2024

[12] [12]

Image segmentation to identify safe landing zones for unmanned aerial vehicles

Joe Kinahan and Alan F Smeaton. Image segmentation to identify safe landing zones for unmanned aerial vehicles. Irish Conference on Artificial Intelligence and Cognitive Sci- ence (AICS), pages 235–247, 2021. 1, 2

work page 2021

[13] [13]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1): 159–174, 1977. 15

work page 1977

[14] [14]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755. Springer, 2014. 5

work page 2014

[15] [15]

A real- time and multi-sensor-based landing area recognition system for UA Vs.Drones, 6(5):118, 2022

Fei Liu, Jiayao Shan, Binyu Xiong, and Zheng Fang. A real- time and multi-sensor-based landing area recognition system for UA Vs.Drones, 6(5):118, 2022. 1

work page 2022

[16] [16]

SafeUA V: Learn- ing to estimate depth and safe landing areas for UA Vs from synthetic data

Alina Marcu, Dragos Costea, Vlad Licaret, Mihai Pîrvu, Emil Slusanschi, and Marius Leordeanu. SafeUA V: Learn- ing to estimate depth and safe landing areas for UA Vs from synthetic data. InProceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 43–58, 2018. 2

work page 2018

[17] [17]

Light-weight approach for safe landing in populated areas

Tilemachos Mitroudas, Vasiliki Balaska, Athanasios Pso- moulis, and Antonios Gasteratos. Light-weight approach for safe landing in populated areas. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 10027–10032, 2024. 1

work page 2024

[18] [18]

Autonomous UA V mission cycling: A mobile hub approach for precise landings and continuous operations in challenging environ- ments

Alexander Moortgat-Pick, Marie Schwahn, Anna Adam- czyk, Daniel A Duecker, and Sami Haddadin. Autonomous UA V mission cycling: A mobile hub approach for precise landings and continuous operations in challenging environ- ments. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 8450–8456, 2024. 1

work page 2024

[19] [19]

Vi- sion transformers for dense prediction

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12179–12188, 2021. 2

work page 2021

[20] [20]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv:2408.00714, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

You only look once: Unified, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 779– 788, 2016. 5

work page 2016

[22] [22]

Dino-x: A unified vision model for open-world object detection and understanding, 2024

Tianhe Ren, Yihao Chen, Qing Jiang, Zhaoyang Zeng, Yuda Xiong, Wenlong Liu, Zhengyu Ma, Junyi Shen, Yuan Gao, Xiaoke Jiang, Xingyu Chen, Zhuheng Song, Yuhong Zhang, Hongjie Huang, Han Gao, Shilong Liu, Hao Zhang, Feng Li, Kent Yu, and Lei Zhang. Dino-x: A unified vision model for open-world object detection and understanding. arXiv:2411.14347, 2024. 2

work page arXiv 2024

[23] [23]

Dynamic texts from UA V perspective natural images

Hidetomo Sakaino. Dynamic texts from UA V perspective natural images. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 2070– 2081, 2023. 2

work page 2070

[24] [24]

Practical and safe navigation function based motion plan- ning of UA Vs

Himani Sinhmar, Marcus Greiff, and Stefano Di Cairano. Practical and safe navigation function based motion plan- ning of UA Vs. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 12186–12192,

work page

[25] [25]

Risk-aware emergency landing planning for gliding aircraft model in ur- ban environments

Jakub Sláma, Jáchym Herynek, and Jan Faigl. Risk-aware emergency landing planning for gliding aircraft model in ur- ban environments. In2023 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), pages 4820– 4826, 2023. 1

work page 2023

[26] [26]

Multi- UA V disaster environment coverage planning with limited- endurance

Hongyu Song, Jincheng Yu, Jiantao Qiu, Zhixiao Sun, Kui- jun Lang, Qing Luo, Yuan Shen, and Yu Wang. Multi- UA V disaster environment coverage planning with limited- endurance. In2022 International Conference on Robotics and Automation (ICRA), pages 10760–10766. IEEE, 2022. 1

work page 2022

[27] [27]

Exploring in-memory accelerators and FP- GAs for latency-sensitive DNN inference on edge servers

Ali Suvizi, Suresh Subramaniam, Tian Lan, and Guru Venkataramani. Exploring in-memory accelerators and FP- GAs for latency-sensitive DNN inference on edge servers. In 2024 IEEE Cloud Summit, pages 1–6, 2024. 8

work page 2024

[28] [28]

Chain- of-thought flight planner: End-to-end llm routing under wind hazards

Amin Tabrizian, Mahyar Ghazanfari, and Peng Wei. Chain- of-thought flight planner: End-to-end llm routing under wind hazards. InAIAA AVIATION FORUM AND ASCEND, page 3711, 2025. 2

work page 2025

[29] [29]

Vis- landing: Monocular 3D perception for UA V safe landing via depth-normal synergy

Zhuoyue Tan, Boyong He, Yuxiang Ji, and Liaoni Wu. Vis- landing: Monocular 3D perception for UA V safe landing via depth-normal synergy. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025. to appear. 1, 2

work page 2025

[30] [30]

The state-of-the-art of human–drone interaction: A survey.IEEE Access, 7: 167438–167454, 2019

Dante Tezza and Marvin Andujar. The state-of-the-art of human–drone interaction: A survey.IEEE Access, 7: 167438–167454, 2019. 1

work page 2019

[31] [31]

UA Vs meet LLMs: Overviews and per- spectives towards agentic low-altitude mobility.Information Fusion, 122:103158, 2025

Yonglin Tian, Fei Lin, Yiduo Li, Tengchao Zhang, Qiyao Zhang, Xuan Fu, Jun Huang, Xingyuan Dai, Yutong Wang, Chunwei Tian, et al. UA Vs meet LLMs: Overviews and per- spectives towards agentic low-altitude mobility.Information Fusion, 122:103158, 2025. 2

work page 2025

[32] [32]

Landing zone detection for MA Vs using depth images and vision transformers

Victoria Eugenia Vazquez-Meza and Jose Martinez- Carranza. Landing zone detection for MA Vs using depth images and vision transformers. InProceedings of the 15th Annual International Micro Air Vehicle Conference and Competition (IMAV 2024), pages 162–169, 2024. 1, 2

work page 2024

[33] [33]

Transformer or CNN? benchmarking real-time detection transformer and YOLOv8 for small UAS autonomous landing

Can X Vu, Mahyar Ghazanfari, Kevin Dong, Abenezer Taye, Amin Tabrizian, and Peng Wei. Transformer or CNN? benchmarking real-time detection transformer and YOLOv8 for small UAS autonomous landing. InAIAA AVIATION FO- RUM AND ASCEND 2025, page 3521, 2025. 1

work page 2025

[34] [34]

Wing website

Wing. Wing website. https://wing.com/, 2025. [Online; accessed Sep. 10, 2025]. 1

work page 2025

[35] [35]

Depth any- thing v2.Advances in Neural Information Processing Sys- tems (NeurIPS), 37:21875–21911, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Advances in Neural Information Processing Sys- tems (NeurIPS), 37:21875–21911, 2024. 1

work page 2024

[36] [36]

DETRs beat YOLOs on real-time object detection

Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. DETRs beat YOLOs on real-time object detection. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16965–16974, 2024. 5

work page 2024

[37] [37]

Zipline website

Zipline. Zipline website. https://www.zipline.com/, 2025. [Online; accessed Sep. 10, 2025]. 1 See&Say: Supplementary Material Vision–Language Guided Safe Zone Detection for Autonomous Package Delivery Drones This document provides supplementary material for the main paper, including full implementation details, all hyper- parameters, complete VLM prompts,...

work page 2025

[38] [38]

Determine if the landing pad issafefor the current frame (true/false). Decide based on thefinalframe and the previous 5 frames: if there are objects on the landing pad, or there will be objects on the landing pad, declare unsafe, otherwise declare safe

work page

[39] [39]

Provide reasoning usingtemporal cuesanddepth infor- mation

work page

[40] [40]

Predict future safety (will conditions remain safe/un- safe?)

work page

[41] [41]

Provide a singleupdated prompt list: include ALL un- safe objects/surfaces; remove safe ones (e.g.landing pad if confirmed safe, bushes, . . . ). The list must reflect the most recent scene. Unsafe objects include any moving or static objects that are not flat, or are moving and not safe for a package drop. If the drop zone with H sign is un- safe, also a...

work page

[42] [42]

ranked": [ {

Determine whether the primary landing pad with an ‘H’ marking is safe for a drop. Setlanding_pad_safe = falseonlyif you can see any object(s)insidethe landing pad area. Otherwise, setlanding_pad_safe = true. If you cannot locate the landing pad, set it tonulland explain. 2.reasoning: 1–3 short sentences describing what you see onthe pad. 3.future_predicti...

work page