See&Say: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones
Pith reviewed 2026-05-10 15:46 UTC · model grok-4.3
The pith
Vision-language guidance fuses depth and detection to create reliable safety maps for drone package drops.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
See&Say fuses monocular depth gradients with open-vocabulary detection masks to produce safety maps, while the vision-language model dynamically adjusts object category prompts and refines hazard detection across time. When the primary drop area is occupied or unsafe, the system identifies alternative candidate zones. On a curated dataset of urban delivery scenarios with moving objects and human activity, the approach records the highest accuracy and IoU for safety map prediction and stronger results in alternative zone evaluation across thresholds compared with baselines.
What carries the argument
Fusion of monocular depth gradients with open-vocabulary detection masks, guided by a vision-language model for iterative prompt adjustment and hazard refinement.
If this is right
- Drones can produce more accurate safety maps for package drop decisions in cluttered settings.
- Alternative drop zones become available when the primary pad is occupied or unsafe.
- Performance holds across multiple evaluation thresholds for zone selection.
- The final delivery phase gains robustness under time-varying conditions.
- Integrated semantic and geometric reasoning outperforms isolated geometry or segmentation approaches.
Where Pith is reading between the lines
- The same fusion pattern could apply to ground robots needing to choose safe stopping spots in crowds.
- Open-vocabulary detection reduces dependence on hand-curated lists of hazards.
- Real-time versions might feed directly into onboard flight controllers for live replanning.
- Dataset collection focused on moving urban elements could serve as a benchmark for related perception tasks.
Load-bearing premise
The vision-language model can reliably and dynamically adjust object category prompts and refine hazard detection across time in dynamic urban conditions with moving objects and human activities.
What would settle it
A sequence of real urban drone footage in which moving pedestrians or changing lighting cause the generated safety map to flag an actually safe zone as hazardous or miss a clear hazard, yielding accuracy and IoU no better than depth-only or segmentation-only baselines.
Figures
read the original abstract
Autonomous drone delivery systems are rapidly advancing, but ensuring safe and reliable package drop-offs remains highly challenging in cluttered urban and suburban environments where accurately identifying suitable package drop zones is critical. Existing approaches typically rely on either geometry-based analysis or semantic segmentation alone, but these methods lack the integrated semantic reasoning required for robust decision-making. To address this gap, we propose See&Say, a novel framework that combines geometric safety cues with semantic perception, guided by a Vision-Language Model (VLM) for iterative refinement. The system fuses monocular depth gradients with open-vocabulary detection masks to produce safety maps, while the VLM dynamically adjusts object category prompts and refines hazard detection across time, enabling reliable reasoning under dynamic conditions during the final delivery phase. When the primary drop-pad is occupied or unsafe, the proposed See&Say also identifies alternative candidate zones for package delivery. We curated a dataset of urban delivery scenarios with moving objects and human activities to evaluate the approach. Experimental results show that See&Say outperforms all baselines, achieving the highest accuracy and IoU for safety map prediction as well as superior performance in alternative drop zone evaluation across multiple thresholds. These findings highlight the promise of VLM-guided segmentation-depth fusion for advancing safe and practical drone-based package delivery.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes See&Say, a framework for identifying safe package drop zones for autonomous delivery drones in cluttered urban environments. It fuses monocular depth gradients with open-vocabulary segmentation masks and uses a Vision-Language Model (VLM) for iterative prompt adjustment and hazard refinement across time to handle dynamic conditions such as moving objects and human activity. When the primary zone is unsafe, the system identifies alternative candidate zones. The authors curate a custom dataset of urban scenarios and report that See&Say outperforms all baselines on accuracy and IoU for safety map prediction as well as on alternative-zone metrics across multiple thresholds.
Significance. If the empirical claims can be substantiated with rigorous validation, the work offers a practical integration of geometric cues, open-vocabulary detection, and VLM reasoning for safety-critical drone decisions. The emphasis on dynamic urban conditions and alternative-zone fallback addresses a concrete deployment gap in autonomous delivery systems.
major comments (3)
- [§4] §4 (Experimental Results): The abstract asserts that See&Say achieves the highest accuracy and IoU for safety map prediction plus superior alternative-zone performance, yet supplies no baseline definitions, dataset size, number of sequences, error bars, or statistical tests. This absence prevents assessment of the central outperformance claim.
- [§3 and §4] §3 (Method) and §4 (Experiments): No ablation is reported that isolates the VLM iterative refinement and dynamic prompt adjustment from the underlying depth-gradient + open-vocabulary mask fusion. Without this isolation, especially on sequences containing motion, it remains unclear whether the reported gains are attributable to the VLM component emphasized in the abstract.
- [§4] §4 (Experiments): The manuscript provides no quantitative analysis of VLM prompt stability, false-negative reduction, or failure cases on moving objects and human activities, which are the exact conditions cited as motivation for the VLM-guided approach.
minor comments (1)
- [Abstract] Abstract: Consider adding one sentence on the specific VLM backbone and the number of baselines compared to give readers immediate context for the claimed superiority.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and will incorporate the suggested improvements in the revised manuscript to strengthen the empirical validation.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Results): The abstract asserts that See&Say achieves the highest accuracy and IoU for safety map prediction plus superior alternative-zone performance, yet supplies no baseline definitions, dataset size, number of sequences, error bars, or statistical tests. This absence prevents assessment of the central outperformance claim.
Authors: We agree that the current presentation lacks sufficient detail for rigorous evaluation. In the revised manuscript we will expand §4 to explicitly define all baselines, report the exact size of the curated urban dataset (including number of scenarios, sequences, and frames), provide error bars from multiple runs or cross-validation, and include statistical significance tests (e.g., paired t-tests or Wilcoxon tests) comparing See&Say against baselines. The abstract will be updated if necessary to reference these additions. revision: yes
-
Referee: [§3 and §4] §3 (Method) and §4 (Experiments): No ablation is reported that isolates the VLM iterative refinement and dynamic prompt adjustment from the underlying depth-gradient + open-vocabulary mask fusion. Without this isolation, especially on sequences containing motion, it remains unclear whether the reported gains are attributable to the VLM component emphasized in the abstract.
Authors: We acknowledge the value of isolating the VLM contribution. We will add a dedicated ablation study in the revised §4 that compares the full See&Say pipeline (with VLM iterative refinement and dynamic prompt adjustment) against the base depth-gradient + open-vocabulary mask fusion without the VLM. Results will be reported separately on static and motion-containing sequences to quantify the incremental benefit of the VLM component. revision: yes
-
Referee: [§4] §4 (Experiments): The manuscript provides no quantitative analysis of VLM prompt stability, false-negative reduction, or failure cases on moving objects and human activities, which are the exact conditions cited as motivation for the VLM-guided approach.
Authors: We agree that quantitative characterization of the VLM's role under dynamic conditions is needed. In the revision we will add metrics for prompt stability across frames, quantitative false-negative reduction on moving objects and humans, and a breakdown of failure cases with examples drawn from the dataset. These will be presented in §4 alongside the existing results. revision: yes
Circularity Check
No significant circularity
full rationale
The manuscript describes an applied vision-language system for drone drop-zone detection that fuses monocular depth with open-vocabulary masks and uses a VLM for prompt refinement. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All performance claims are empirical (accuracy, IoU, alternative-zone detection) evaluated on a curated dataset; none reduce by construction to the inputs or to prior self-authored results. The work is self-contained as a descriptive engineering contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Vision-language models can effectively reason about scene safety and dynamically adjust prompts for hazard detection in real time
Reference graph
Works this paper leans on
-
[1]
Safe landing zones de- tection for UA Vs using deep regression
Sakineh Abdollahzadeh, Pier-Luc Proulx, Mohand Said Allili, and Jean-François Lapointe. Safe landing zones de- tection for UA Vs using deep regression. In2022 19th Con- ference on Robots and Vision (CRV), pages 213–218. IEEE,
-
[2]
Real-time multi-modal semantic fusion on unmanned aerial vehicles
Simon Bultmann, Jan Quenzel, and Sven Behnke. Real-time multi-modal semantic fusion on unmanned aerial vehicles. In2021 European Conference on Mobile Robots (ECMR), pages 1–8. IEEE, 2021. 2
work page 2021
-
[3]
Visdrone-det2021: The vision meets drone object detection challenge results
Yaru Cao, Zhijian He, Lujia Wang, Wenguan Wang, Yix- uan Yuan, Dingwen Zhang, Jinglin Zhang, Pengfei Zhu, Luc Van Gool, Junwei Han, et al. Visdrone-det2021: The vision meets drone object detection challenge results. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 2847–2854, 2021. 5
work page 2021
-
[4]
Lyujie Chen, Xiaming Yuan, Yao Xiao, Yiding Zhang, and Jihong Zhu. Robust autonomous landing of UA V in non- cooperative environments based on dynamic time camera- lidar fusion.arXiv:2011.13761, 2020. 1, 2
-
[5]
Yolo-world: Real-time open-vocabulary object detection
Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 16901–16911, 2024. 8
work page 2024
-
[6]
Vision-Based Risk Aware Emergency Landing for UAVs in Complex Urban Environments
Julio de la Torre-Vanegas, Miguel Soriano-Garcia, Israel Be- cerra, and Diego Mercado-Ravell. Vision-based risk aware emergency landing for UA Vs in complex urban environ- ments.arXiv:2505.20423, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Package delivery based on the leader-follower control paradigm for multirobot systems
Emanuele dos Santos Cardoso, Vinícius Pacheco Bacheti, and Mário Sarcinelli-Filho. Package delivery based on the leader-follower control paradigm for multirobot systems. InInternational Conference on Unmanned Aircraft Systems (ICUAS), pages 775–781. IEEE, 2023. 1
work page 2023
-
[8]
Mid-air: A multi-modal dataset for extremely low altitude drone flights
Michael Fonder and Marc Van Droogenbroeck. Mid-air: A multi-modal dataset for extremely low altitude drone flights. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 553–562, 2019. 4
work page 2019
-
[9]
Semantic drone dataset (dronedataset)
Institute of Computer Graphics and Vision (ICG), Graz Uni- versity of Technology (TU Graz). Semantic drone dataset (dronedataset). [Online]. Available: http://dronedataset.icg. tugraz.at/, 2019. 4
work page 2019
-
[10]
Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralytics YOLOv8, 2023. 5
work page 2023
-
[11]
Weather-aware drone-view object detection via envi- ronmental context understanding
Hyunjun Kim, Dahye Lee, Sungjune Park, and Yong Man Ro. Weather-aware drone-view object detection via envi- ronmental context understanding. In2024 IEEE Interna- tional Conference on Image Processing (ICIP), pages 549– 555, 2024. 2
work page 2024
-
[12]
Image segmentation to identify safe landing zones for unmanned aerial vehicles
Joe Kinahan and Alan F Smeaton. Image segmentation to identify safe landing zones for unmanned aerial vehicles. Irish Conference on Artificial Intelligence and Cognitive Sci- ence (AICS), pages 235–247, 2021. 1, 2
work page 2021
-
[13]
J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1): 159–174, 1977. 15
work page 1977
-
[14]
Microsoft COCO: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755. Springer, 2014. 5
work page 2014
-
[15]
A real- time and multi-sensor-based landing area recognition system for UA Vs.Drones, 6(5):118, 2022
Fei Liu, Jiayao Shan, Binyu Xiong, and Zheng Fang. A real- time and multi-sensor-based landing area recognition system for UA Vs.Drones, 6(5):118, 2022. 1
work page 2022
-
[16]
SafeUA V: Learn- ing to estimate depth and safe landing areas for UA Vs from synthetic data
Alina Marcu, Dragos Costea, Vlad Licaret, Mihai Pîrvu, Emil Slusanschi, and Marius Leordeanu. SafeUA V: Learn- ing to estimate depth and safe landing areas for UA Vs from synthetic data. InProceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 43–58, 2018. 2
work page 2018
-
[17]
Light-weight approach for safe landing in populated areas
Tilemachos Mitroudas, Vasiliki Balaska, Athanasios Pso- moulis, and Antonios Gasteratos. Light-weight approach for safe landing in populated areas. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 10027–10032, 2024. 1
work page 2024
-
[18]
Alexander Moortgat-Pick, Marie Schwahn, Anna Adam- czyk, Daniel A Duecker, and Sami Haddadin. Autonomous UA V mission cycling: A mobile hub approach for precise landings and continuous operations in challenging environ- ments. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 8450–8456, 2024. 1
work page 2024
-
[19]
Vi- sion transformers for dense prediction
René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12179–12188, 2021. 2
work page 2021
-
[20]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv:2408.00714, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
You only look once: Unified, real-time object de- tection
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 779– 788, 2016. 5
work page 2016
-
[22]
Dino-x: A unified vision model for open-world object detection and understanding, 2024
Tianhe Ren, Yihao Chen, Qing Jiang, Zhaoyang Zeng, Yuda Xiong, Wenlong Liu, Zhengyu Ma, Junyi Shen, Yuan Gao, Xiaoke Jiang, Xingyu Chen, Zhuheng Song, Yuhong Zhang, Hongjie Huang, Han Gao, Shilong Liu, Hao Zhang, Feng Li, Kent Yu, and Lei Zhang. Dino-x: A unified vision model for open-world object detection and understanding. arXiv:2411.14347, 2024. 2
-
[23]
Dynamic texts from UA V perspective natural images
Hidetomo Sakaino. Dynamic texts from UA V perspective natural images. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 2070– 2081, 2023. 2
work page 2070
-
[24]
Practical and safe navigation function based motion plan- ning of UA Vs
Himani Sinhmar, Marcus Greiff, and Stefano Di Cairano. Practical and safe navigation function based motion plan- ning of UA Vs. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 12186–12192,
-
[25]
Risk-aware emergency landing planning for gliding aircraft model in ur- ban environments
Jakub Sláma, Jáchym Herynek, and Jan Faigl. Risk-aware emergency landing planning for gliding aircraft model in ur- ban environments. In2023 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), pages 4820– 4826, 2023. 1
work page 2023
-
[26]
Multi- UA V disaster environment coverage planning with limited- endurance
Hongyu Song, Jincheng Yu, Jiantao Qiu, Zhixiao Sun, Kui- jun Lang, Qing Luo, Yuan Shen, and Yu Wang. Multi- UA V disaster environment coverage planning with limited- endurance. In2022 International Conference on Robotics and Automation (ICRA), pages 10760–10766. IEEE, 2022. 1
work page 2022
-
[27]
Exploring in-memory accelerators and FP- GAs for latency-sensitive DNN inference on edge servers
Ali Suvizi, Suresh Subramaniam, Tian Lan, and Guru Venkataramani. Exploring in-memory accelerators and FP- GAs for latency-sensitive DNN inference on edge servers. In 2024 IEEE Cloud Summit, pages 1–6, 2024. 8
work page 2024
-
[28]
Chain- of-thought flight planner: End-to-end llm routing under wind hazards
Amin Tabrizian, Mahyar Ghazanfari, and Peng Wei. Chain- of-thought flight planner: End-to-end llm routing under wind hazards. InAIAA AVIATION FORUM AND ASCEND, page 3711, 2025. 2
work page 2025
-
[29]
Vis- landing: Monocular 3D perception for UA V safe landing via depth-normal synergy
Zhuoyue Tan, Boyong He, Yuxiang Ji, and Liaoni Wu. Vis- landing: Monocular 3D perception for UA V safe landing via depth-normal synergy. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025. to appear. 1, 2
work page 2025
-
[30]
The state-of-the-art of human–drone interaction: A survey.IEEE Access, 7: 167438–167454, 2019
Dante Tezza and Marvin Andujar. The state-of-the-art of human–drone interaction: A survey.IEEE Access, 7: 167438–167454, 2019. 1
work page 2019
-
[31]
Yonglin Tian, Fei Lin, Yiduo Li, Tengchao Zhang, Qiyao Zhang, Xuan Fu, Jun Huang, Xingyuan Dai, Yutong Wang, Chunwei Tian, et al. UA Vs meet LLMs: Overviews and per- spectives towards agentic low-altitude mobility.Information Fusion, 122:103158, 2025. 2
work page 2025
-
[32]
Landing zone detection for MA Vs using depth images and vision transformers
Victoria Eugenia Vazquez-Meza and Jose Martinez- Carranza. Landing zone detection for MA Vs using depth images and vision transformers. InProceedings of the 15th Annual International Micro Air Vehicle Conference and Competition (IMAV 2024), pages 162–169, 2024. 1, 2
work page 2024
-
[33]
Can X Vu, Mahyar Ghazanfari, Kevin Dong, Abenezer Taye, Amin Tabrizian, and Peng Wei. Transformer or CNN? benchmarking real-time detection transformer and YOLOv8 for small UAS autonomous landing. InAIAA AVIATION FO- RUM AND ASCEND 2025, page 3521, 2025. 1
work page 2025
-
[34]
Wing. Wing website. https://wing.com/, 2025. [Online; accessed Sep. 10, 2025]. 1
work page 2025
-
[35]
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Advances in Neural Information Processing Sys- tems (NeurIPS), 37:21875–21911, 2024. 1
work page 2024
-
[36]
DETRs beat YOLOs on real-time object detection
Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. DETRs beat YOLOs on real-time object detection. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16965–16974, 2024. 5
work page 2024
-
[37]
Zipline. Zipline website. https://www.zipline.com/, 2025. [Online; accessed Sep. 10, 2025]. 1 See&Say: Supplementary Material Vision–Language Guided Safe Zone Detection for Autonomous Package Delivery Drones This document provides supplementary material for the main paper, including full implementation details, all hyper- parameters, complete VLM prompts,...
work page 2025
-
[38]
Determine if the landing pad issafefor the current frame (true/false). Decide based on thefinalframe and the previous 5 frames: if there are objects on the landing pad, or there will be objects on the landing pad, declare unsafe, otherwise declare safe
-
[39]
Provide reasoning usingtemporal cuesanddepth infor- mation
-
[40]
Predict future safety (will conditions remain safe/un- safe?)
-
[41]
Provide a singleupdated prompt list: include ALL un- safe objects/surfaces; remove safe ones (e.g.landing pad if confirmed safe, bushes, . . . ). The list must reflect the most recent scene. Unsafe objects include any moving or static objects that are not flat, or are moving and not safe for a package drop. If the drop zone with H sign is un- safe, also a...
-
[42]
Determine whether the primary landing pad with an ‘H’ marking is safe for a drop. Setlanding_pad_safe = falseonlyif you can see any object(s)insidethe landing pad area. Otherwise, setlanding_pad_safe = true. If you cannot locate the landing pad, set it tonulland explain. 2.reasoning: 1–3 short sentences describing what you see onthe pad. 3.future_predicti...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.