Sky-Ear: An Unmanned Aerial Vehicle-Enabled Victim Sound Detection and Localization System
Pith reviewed 2026-05-10 14:21 UTC · model grok-4.3
The pith
Sky-Ear mounts a circular microphone array on a UAV and uses two-stage Sentinel-Responder processing with a masking autoencoder to detect and localize victim sounds energy-efficiently for search-and-rescue.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Sky-Ear system achieves energy-efficient acoustic sensing and sound detection for SAR by mounting a circular-shaped microphone array on a UAV and applying two-stage Sentinel and Responder audio processing. The Sentinel stage uses a Masking autoencoder-based method to analyze frequency-time acoustic features for initial detection. The Responder stage performs continuous localization by optimizing detected directions from multiple observations. Extensive simulation experiments validate the resulting victim detection accuracy and localization error.
What carries the argument
Two-stage Sentinel-Responder audio processing pipeline on a circular microphone array, where the Sentinel stage applies a masking autoencoder to frequency-time features and the Responder stage optimizes direction estimates across observations.
If this is right
- The masking autoencoder in the Sentinel stage reduces continuous power draw while still catching victim sounds.
- Optimizing directions from multiple observations lowers localization error compared to single-pass methods.
- The circular array geometry supports reliable direction finding even when the UAV is moving.
- Simulation-validated accuracy supports deployment in energy-constrained SAR missions.
- Two-stage separation keeps the high-precision Responder stage inactive until a sound is flagged.
Where Pith is reading between the lines
- Field tests in real wind and propeller noise would show how much the circular array's performance drops from the simulated ideal.
- Pairing Sky-Ear with existing visual or thermal cameras on the same UAV could cut false positives by cross-checking audio alerts.
- Adjusting the autoencoder training set to include more propeller noise samples might improve robustness on different drone models.
Load-bearing premise
That simulation experiments alone can confirm the system's victim detection accuracy and localization performance without modeling real UAV flight dynamics, wind noise, or onboard hardware limits.
What would settle it
Run the full Sky-Ear hardware on a physical UAV during controlled outdoor flights that replicate SAR conditions and measure whether actual detection accuracy and localization error match the reported simulation numbers.
Figures
read the original abstract
Unmanned Aerial Vehicles (UAVs) are increasingly deployed in search-and-rescue (SAR) missions, yet continuous and reliable victim detection and localization remain challenging due to on-board hardware constraints. This paper designs an UAV-Enabled Victim Sound Detection and Localization System (called ``Sky-Ear'' for brevity) to achieve energy-efficient acoustic sensing and sound detection for SAR. Based on a circular-shaped microphone array, two-stage (Sentinel and Responder) audio processing is developed for energy-consuming and highly reliable sound detection. A Masking autoencoder (MAE)-based sound detection method is designed in the Sentinel stage to analyze frequency-time acoustic features. For improved precision, a continuous localization method is designed by optimizing detected directions from multiple observations. Extensive simulation experiments are conducted to validate the system's performance in terms of victim detection accuracy and localization error.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Sky-Ear, a UAV-based victim sound detection and localization system for search-and-rescue. It employs a circular microphone array with a two-stage Sentinel/Responder pipeline: the Sentinel stage uses a Masked Autoencoder (MAE) on frequency-time features for energy-efficient detection, while the Responder stage performs detailed analysis; localization optimizes directions across multiple observations. The central claim is that this architecture achieves reliable detection and low localization error, as demonstrated by extensive simulation experiments.
Significance. If the simulation results are shown to hold under realistic conditions, the two-stage pipeline and MAE-based detection offer a concrete, practical contribution to energy-constrained acoustic sensing on UAVs for SAR. The multi-observation localization approach is a sensible way to improve precision without continuous high-power processing.
major comments (1)
- [Simulation Experiments] Simulation validation section: The manuscript states that extensive simulations validate high victim detection accuracy and low localization error, yet provides no explicit description of how the acoustic model incorporates the dominant SAR interferers—strong time-varying rotor noise at the array, Doppler/phase shifts from UAV translation and rotation, or wind turbulence. Without these, the reported performance metrics cannot be taken as evidence that the Sentinel/Responder pipeline and MAE detector will deliver the claimed reliability under operational conditions. This directly undermines the central claim that the system is both energy-efficient and highly reliable.
minor comments (1)
- [Abstract] Abstract: The phrase 'for energy-consuming and highly reliable sound detection' contradicts the earlier claim of 'energy-efficient acoustic sensing'; this appears to be a wording error that should be corrected for consistency.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the potential contribution of the two-stage pipeline and multi-observation localization. We address the major comment on the simulation validation below.
read point-by-point responses
-
Referee: [Simulation Experiments] Simulation validation section: The manuscript states that extensive simulations validate high victim detection accuracy and low localization error, yet provides no explicit description of how the acoustic model incorporates the dominant SAR interferers—strong time-varying rotor noise at the array, Doppler/phase shifts from UAV translation and rotation, or wind turbulence. Without these, the reported performance metrics cannot be taken as evidence that the Sentinel/Responder pipeline and MAE detector will deliver the claimed reliability under operational conditions. This directly undermines the central claim that the system is both energy-efficient and highly reliable.
Authors: We agree that the current description of the acoustic model in the simulation experiments section is insufficiently detailed. In the revised manuscript we will expand this section with an explicit description of the signal generation process, including the modeling of time-varying rotor noise at the microphone array, Doppler and phase shifts induced by UAV translation and rotation, and the effects of wind turbulence. These additions will clarify the simulation assumptions and allow readers to better assess the relevance of the reported detection accuracy and localization error to operational SAR conditions. revision: yes
Circularity Check
No circularity in derivation chain; system design uses standard components validated by simulation.
full rationale
The paper presents a UAV acoustic sensing system using a circular microphone array, a two-stage Sentinel/Responder pipeline, MAE-based detection on frequency-time features, and multi-observation direction optimization for localization. No equations, predictions, or first-principles derivations are described that reduce to fitted parameters or self-referential definitions. The approach applies established signal-processing and ML techniques without claiming novel mathematical results that loop back to inputs. Simulations are invoked for validation, but this is empirical testing rather than a derivation that is tautological by construction. No self-citation chains or uniqueness theorems are load-bearing in the provided text. The central claims rest on engineering choices and experimental outcomes, not on any step that is equivalent to its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Unmanned aerial vehicles for search and rescue: a survey,
M. Lyu, Y . Zhao, C. Huang, and H. Huang, “Unmanned aerial vehicles for search and rescue: a survey,”Remote Sens., vol. 15, no. 13, p. 3266, 2023
work page 2023
-
[2]
Thermal, multispectral, and RGB vision systems analysis for victim detection in SAR robotics,
C. Cruz Ulloa, D. Orbea, J. del Cerro, and A. Barrientos, “Thermal, multispectral, and RGB vision systems analysis for victim detection in SAR robotics,”Appl. Sci., vol. 14, no. 2, p. 766, 2024
work page 2024
-
[3]
A. V . R. Katkuri, H. Madan, N. Khatri, A. S. H. Abdul-Qawy, and K. S. Patnaik, “Autonomous uav navigation using deep learning-based computer vision frameworks: A systematic literature review,”Array, vol. 23, p. 100361, 2024
work page 2024
-
[4]
Enhancing search and rescue missions with uav thermal video tracking,
P. Fraternali, L. Morandini, and R. Motta, “Enhancing search and rescue missions with uav thermal video tracking,”Remote Sensing, vol. 17, no. 17, p. 3032, 2025
work page 2025
-
[5]
M. Mikesikowska, “Classification of unmanned aerial vehicles based on acoustic signals obtained in external environmental conditions,”Sensors, vol. 24, no. 17, p. 5663, 2024
work page 2024
-
[6]
Arbitrary microphone array optimization method based on TDOA for specific localization scenarios,
H. Liu, T. Kirubarajan, and Q. Xiao, “Arbitrary microphone array optimization method based on TDOA for specific localization scenarios,” Sensors, vol. 19, no. 19, p. 4326, 2019
work page 2019
-
[7]
Drone-based sound source localization: a systematic literature review,
S. F. Chevtchenko, B. J. Rodr ´ıguez, R. Vale, A. Soti, Y . Bethi, N. Ibnul, A. Marcireau, M. R. Azghadi, A. Wabnitz, and S. Afshar, “Drone-based sound source localization: a systematic literature review,”IEEE Access, vol. 13, pp. 94 256–94 274, 2025
work page 2025
-
[8]
F. A. de Alcantara Andrade, A. Reinier Hovenburg, L. Netto de Lima, C. Dahlin Rodin, T. A. Johansen, R. Storvold, C. A. Moraes Correia, and D. Barreto Haddad, “Autonomous unmanned aerial vehicles in search and rescue missions using real-time cooperative model predictive control,”Sensors, vol. 19, no. 19, p. 4067, 2019
work page 2019
-
[9]
Y . L. Y . Hong, M. Wang, “Complementary materials,” https://github. com/yalin-liu/spawc2026.git, retrieved on 2026-4-5
work page 2026
-
[10]
——, “Drone sound dataset,” https://github.com/Mikeahhh/MAE, re- trieved on 2026-4-4
work page 2026
-
[11]
Sound 645305 (desert environment),
DarkShroom, “Sound 645305 (desert environment),” https://freesound. org/people/DarkShroom/sounds/645305/, Freesound, 2022
work page 2022
-
[12]
KasDonatov, “Desert wind stereo,” https://freesound.org/people/ KasDonatov/sounds/402710/, Freesound, 2017
work page 2017
-
[13]
Forest ambiance sound effects,
Pixabay, “Forest ambiance sound effects,” https://pixabay.com/ sound-effects/search/forest/, Pixabay
-
[14]
D. Landry, Q. He, H. Yan, and Y . Li, “Asvp-esd: A dataset and its benchmark for emotion recognition using both speech and non-speech utterances,”Global Scientific Journals, vol. 8, pp. 1793–1798, 2020
work page 2020
-
[15]
L. Zhao, M. Li, Y . Wang, and L. Chen, “Combined effects of sound and temperature on the composition and function of bacterial and fungal communities in loess,”BMC microbiology, vol. 25, no. 1, p. 803, 2025
work page 2025
-
[16]
Soundscape effects on visiting experience in city park: A case study in fuzhou, china,
J. Liu, Y . Xiong, Y . Wang, and T. Luo, “Soundscape effects on visiting experience in city park: A case study in fuzhou, china,”Urban forestry & urban greening, vol. 31, pp. 38–47, 2018
work page 2018
-
[17]
Droneaudioset: An audio dataset for drone-based search and rescue,
C. Gupta, S. Ramesh, P. Sasikumar, K. P. Yeo, and S. Nanayakkara, “Droneaudioset: An audio dataset for drone-based search and rescue,” arXiv preprint arXiv:2510.15383, 2025
-
[18]
Maximum averaged and peak levels of vocal sound pressure,
B. Boren, A. Roginska, and B. Gill, “Maximum averaged and peak levels of vocal sound pressure,” inAudio Engineering Society Convention 135. Audio Engineering Society, 2013
work page 2013
-
[19]
The training process of MAE models,
Y . L. Y . Hong, M. Wang, “The training process of MAE models,” https: //github.com/Mikeahhh/MAE, retrieved on 2026-4-5
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.