Sky-Ear: An Unmanned Aerial Vehicle-Enabled Victim Sound Detection and Localization System

Kevin Hung; Mingyang Wang; Yalin Liu; Yaru Fu; Yi Hong

arxiv: 2604.12455 · v1 · submitted 2026-04-14 · 📡 eess.AS

Sky-Ear: An Unmanned Aerial Vehicle-Enabled Victim Sound Detection and Localization System

Yi Hong , Mingyang Wang , Yalin Liu , Yaru Fu , Kevin Hung This is my paper

Pith reviewed 2026-05-10 14:21 UTC · model grok-4.3

classification 📡 eess.AS

keywords UAVvictim sound detectionsound localizationsearch and rescuemicrophone arraymasking autoencoderenergy efficiencytwo-stage processing

0 comments

The pith

Sky-Ear mounts a circular microphone array on a UAV and uses two-stage Sentinel-Responder processing with a masking autoencoder to detect and localize victim sounds energy-efficiently for search-and-rescue.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper designs Sky-Ear to solve the problem of continuous and reliable victim detection from UAVs despite hardware limits on power and sensors. It places a circular microphone array on the drone and splits audio handling into a Sentinel stage that applies a masking autoencoder to spot frequency-time patterns quickly and a Responder stage that refines location by combining direction estimates across several observations. The approach aims for both lower energy use and higher reliability than constant full-power listening. A reader would care because current UAV SAR efforts struggle with battery drain during long acoustic searches, and this method promises to keep listening active without exhausting the platform.

Core claim

The Sky-Ear system achieves energy-efficient acoustic sensing and sound detection for SAR by mounting a circular-shaped microphone array on a UAV and applying two-stage Sentinel and Responder audio processing. The Sentinel stage uses a Masking autoencoder-based method to analyze frequency-time acoustic features for initial detection. The Responder stage performs continuous localization by optimizing detected directions from multiple observations. Extensive simulation experiments validate the resulting victim detection accuracy and localization error.

What carries the argument

Two-stage Sentinel-Responder audio processing pipeline on a circular microphone array, where the Sentinel stage applies a masking autoencoder to frequency-time features and the Responder stage optimizes direction estimates across observations.

If this is right

The masking autoencoder in the Sentinel stage reduces continuous power draw while still catching victim sounds.
Optimizing directions from multiple observations lowers localization error compared to single-pass methods.
The circular array geometry supports reliable direction finding even when the UAV is moving.
Simulation-validated accuracy supports deployment in energy-constrained SAR missions.
Two-stage separation keeps the high-precision Responder stage inactive until a sound is flagged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Field tests in real wind and propeller noise would show how much the circular array's performance drops from the simulated ideal.
Pairing Sky-Ear with existing visual or thermal cameras on the same UAV could cut false positives by cross-checking audio alerts.
Adjusting the autoencoder training set to include more propeller noise samples might improve robustness on different drone models.

Load-bearing premise

That simulation experiments alone can confirm the system's victim detection accuracy and localization performance without modeling real UAV flight dynamics, wind noise, or onboard hardware limits.

What would settle it

Run the full Sky-Ear hardware on a physical UAV during controlled outdoor flights that replicate SAR conditions and measure whether actual detection accuracy and localization error match the reported simulation numbers.

Figures

Figures reproduced from arXiv: 2604.12455 by Kevin Hung, Mingyang Wang, Yalin Liu, Yaru Fu, Yi Hong.

**Figure 2.** Figure 2: Anomaly detection accuracy of MAE models versus the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The continuous localization results of “Sky-Ear” along a UAV’s trajectory in SAR. Two scenarios, i.e., the desert and forest, are [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Unmanned Aerial Vehicles (UAVs) are increasingly deployed in search-and-rescue (SAR) missions, yet continuous and reliable victim detection and localization remain challenging due to on-board hardware constraints. This paper designs an UAV-Enabled Victim Sound Detection and Localization System (called ``Sky-Ear'' for brevity) to achieve energy-efficient acoustic sensing and sound detection for SAR. Based on a circular-shaped microphone array, two-stage (Sentinel and Responder) audio processing is developed for energy-consuming and highly reliable sound detection. A Masking autoencoder (MAE)-based sound detection method is designed in the Sentinel stage to analyze frequency-time acoustic features. For improved precision, a continuous localization method is designed by optimizing detected directions from multiple observations. Extensive simulation experiments are conducted to validate the system's performance in terms of victim detection accuracy and localization error.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sky-Ear applies a two-stage MAE pipeline and multi-observation localization to UAV acoustic SAR but relies on simulations that likely overlook rotor noise and flight dynamics.

read the letter

This paper gives a workable two-stage system for energy-efficient sound detection from UAVs in rescue missions, but the simulation validation doesn't address the key real-world noise and motion problems. They mount a circular microphone array and split the work into a low-power sentinel stage that runs a masking autoencoder on frequency-time features to flag candidate sounds, then a responder stage for deeper checks. Localization improves by optimizing direction estimates across multiple observations as the drone flies. The energy-saving angle is the strongest part. Continuous high-resolution audio would drain batteries fast, so gating with a lightweight detector is a sensible engineering choice. The MAE step fits because it already works on spectrogram-style inputs elsewhere, and the multi-observation optimization is a straightforward way to gain precision from the UAV's own movement without extra hardware. The main weakness is the evidence. Claims of high detection accuracy and low localization error rest on extensive simulations, yet nothing in the description shows they modeled the dominant interferers: loud varying rotor noise at the mics, phase and Doppler shifts from translation and rotation, or wind turbulence. Without those, the numbers stay optimistic and do not demonstrate reliability under actual SAR conditions. This is aimed at applied researchers building drone-based rescue systems who already know acoustic array basics. It extends known techniques to the energy-constrained UAV setting rather than introducing new theory. I would bring it to a reading group to discuss the pipeline design, but I would not cite it unless the simulations get substantially more realistic. It deserves peer review because the application matters and the approach is coherent, even if the validation section needs tightening on realism.

Referee Report

1 major / 1 minor

Summary. The paper proposes Sky-Ear, a UAV-based victim sound detection and localization system for search-and-rescue. It employs a circular microphone array with a two-stage Sentinel/Responder pipeline: the Sentinel stage uses a Masked Autoencoder (MAE) on frequency-time features for energy-efficient detection, while the Responder stage performs detailed analysis; localization optimizes directions across multiple observations. The central claim is that this architecture achieves reliable detection and low localization error, as demonstrated by extensive simulation experiments.

Significance. If the simulation results are shown to hold under realistic conditions, the two-stage pipeline and MAE-based detection offer a concrete, practical contribution to energy-constrained acoustic sensing on UAVs for SAR. The multi-observation localization approach is a sensible way to improve precision without continuous high-power processing.

major comments (1)

[Simulation Experiments] Simulation validation section: The manuscript states that extensive simulations validate high victim detection accuracy and low localization error, yet provides no explicit description of how the acoustic model incorporates the dominant SAR interferers—strong time-varying rotor noise at the array, Doppler/phase shifts from UAV translation and rotation, or wind turbulence. Without these, the reported performance metrics cannot be taken as evidence that the Sentinel/Responder pipeline and MAE detector will deliver the claimed reliability under operational conditions. This directly undermines the central claim that the system is both energy-efficient and highly reliable.

minor comments (1)

[Abstract] Abstract: The phrase 'for energy-consuming and highly reliable sound detection' contradicts the earlier claim of 'energy-efficient acoustic sensing'; this appears to be a wording error that should be corrected for consistency.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the potential contribution of the two-stage pipeline and multi-observation localization. We address the major comment on the simulation validation below.

read point-by-point responses

Referee: [Simulation Experiments] Simulation validation section: The manuscript states that extensive simulations validate high victim detection accuracy and low localization error, yet provides no explicit description of how the acoustic model incorporates the dominant SAR interferers—strong time-varying rotor noise at the array, Doppler/phase shifts from UAV translation and rotation, or wind turbulence. Without these, the reported performance metrics cannot be taken as evidence that the Sentinel/Responder pipeline and MAE detector will deliver the claimed reliability under operational conditions. This directly undermines the central claim that the system is both energy-efficient and highly reliable.

Authors: We agree that the current description of the acoustic model in the simulation experiments section is insufficiently detailed. In the revised manuscript we will expand this section with an explicit description of the signal generation process, including the modeling of time-varying rotor noise at the microphone array, Doppler and phase shifts induced by UAV translation and rotation, and the effects of wind turbulence. These additions will clarify the simulation assumptions and allow readers to better assess the relevance of the reported detection accuracy and localization error to operational SAR conditions. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; system design uses standard components validated by simulation.

full rationale

The paper presents a UAV acoustic sensing system using a circular microphone array, a two-stage Sentinel/Responder pipeline, MAE-based detection on frequency-time features, and multi-observation direction optimization for localization. No equations, predictions, or first-principles derivations are described that reduce to fitted parameters or self-referential definitions. The approach applies established signal-processing and ML techniques without claiming novel mathematical results that loop back to inputs. Simulations are invoked for validation, but this is empirical testing rather than a derivation that is tautological by construction. No self-citation chains or uniqueness theorems are load-bearing in the provided text. The central claims rest on engineering choices and experimental outcomes, not on any step that is equivalent to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not introduce or rely on any free parameters, axioms, or invented entities beyond naming the system 'Sky-Ear'.

pith-pipeline@v0.9.0 · 5451 in / 1059 out tokens · 49869 ms · 2026-05-10T14:21:02.656585+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Unmanned aerial vehicles for search and rescue: a survey,

M. Lyu, Y . Zhao, C. Huang, and H. Huang, “Unmanned aerial vehicles for search and rescue: a survey,”Remote Sens., vol. 15, no. 13, p. 3266, 2023

work page 2023
[2]

Thermal, multispectral, and RGB vision systems analysis for victim detection in SAR robotics,

C. Cruz Ulloa, D. Orbea, J. del Cerro, and A. Barrientos, “Thermal, multispectral, and RGB vision systems analysis for victim detection in SAR robotics,”Appl. Sci., vol. 14, no. 2, p. 766, 2024

work page 2024
[3]

Autonomous uav navigation using deep learning-based computer vision frameworks: A systematic literature review,

A. V . R. Katkuri, H. Madan, N. Khatri, A. S. H. Abdul-Qawy, and K. S. Patnaik, “Autonomous uav navigation using deep learning-based computer vision frameworks: A systematic literature review,”Array, vol. 23, p. 100361, 2024

work page 2024
[4]

Enhancing search and rescue missions with uav thermal video tracking,

P. Fraternali, L. Morandini, and R. Motta, “Enhancing search and rescue missions with uav thermal video tracking,”Remote Sensing, vol. 17, no. 17, p. 3032, 2025

work page 2025
[5]

Classification of unmanned aerial vehicles based on acoustic signals obtained in external environmental conditions,

M. Mikesikowska, “Classification of unmanned aerial vehicles based on acoustic signals obtained in external environmental conditions,”Sensors, vol. 24, no. 17, p. 5663, 2024

work page 2024
[6]

Arbitrary microphone array optimization method based on TDOA for specific localization scenarios,

H. Liu, T. Kirubarajan, and Q. Xiao, “Arbitrary microphone array optimization method based on TDOA for specific localization scenarios,” Sensors, vol. 19, no. 19, p. 4326, 2019

work page 2019
[7]

Drone-based sound source localization: a systematic literature review,

S. F. Chevtchenko, B. J. Rodr ´ıguez, R. Vale, A. Soti, Y . Bethi, N. Ibnul, A. Marcireau, M. R. Azghadi, A. Wabnitz, and S. Afshar, “Drone-based sound source localization: a systematic literature review,”IEEE Access, vol. 13, pp. 94 256–94 274, 2025

work page 2025
[8]

Autonomous unmanned aerial vehicles in search and rescue missions using real-time cooperative model predictive control,

F. A. de Alcantara Andrade, A. Reinier Hovenburg, L. Netto de Lima, C. Dahlin Rodin, T. A. Johansen, R. Storvold, C. A. Moraes Correia, and D. Barreto Haddad, “Autonomous unmanned aerial vehicles in search and rescue missions using real-time cooperative model predictive control,”Sensors, vol. 19, no. 19, p. 4067, 2019

work page 2019
[9]

Complementary materials,

Y . L. Y . Hong, M. Wang, “Complementary materials,” https://github. com/yalin-liu/spawc2026.git, retrieved on 2026-4-5

work page 2026
[10]

Drone sound dataset,

——, “Drone sound dataset,” https://github.com/Mikeahhh/MAE, re- trieved on 2026-4-4

work page 2026
[11]

Sound 645305 (desert environment),

DarkShroom, “Sound 645305 (desert environment),” https://freesound. org/people/DarkShroom/sounds/645305/, Freesound, 2022

work page 2022
[12]

Desert wind stereo,

KasDonatov, “Desert wind stereo,” https://freesound.org/people/ KasDonatov/sounds/402710/, Freesound, 2017

work page 2017
[13]

Forest ambiance sound effects,

Pixabay, “Forest ambiance sound effects,” https://pixabay.com/ sound-effects/search/forest/, Pixabay

work page
[14]

Asvp-esd: A dataset and its benchmark for emotion recognition using both speech and non-speech utterances,

D. Landry, Q. He, H. Yan, and Y . Li, “Asvp-esd: A dataset and its benchmark for emotion recognition using both speech and non-speech utterances,”Global Scientific Journals, vol. 8, pp. 1793–1798, 2020

work page 2020
[15]

Combined effects of sound and temperature on the composition and function of bacterial and fungal communities in loess,

L. Zhao, M. Li, Y . Wang, and L. Chen, “Combined effects of sound and temperature on the composition and function of bacterial and fungal communities in loess,”BMC microbiology, vol. 25, no. 1, p. 803, 2025

work page 2025
[16]

Soundscape effects on visiting experience in city park: A case study in fuzhou, china,

J. Liu, Y . Xiong, Y . Wang, and T. Luo, “Soundscape effects on visiting experience in city park: A case study in fuzhou, china,”Urban forestry & urban greening, vol. 31, pp. 38–47, 2018

work page 2018
[17]

Droneaudioset: An audio dataset for drone-based search and rescue,

C. Gupta, S. Ramesh, P. Sasikumar, K. P. Yeo, and S. Nanayakkara, “Droneaudioset: An audio dataset for drone-based search and rescue,” arXiv preprint arXiv:2510.15383, 2025

work page arXiv 2025
[18]

Maximum averaged and peak levels of vocal sound pressure,

B. Boren, A. Roginska, and B. Gill, “Maximum averaged and peak levels of vocal sound pressure,” inAudio Engineering Society Convention 135. Audio Engineering Society, 2013

work page 2013
[19]

The training process of MAE models,

Y . L. Y . Hong, M. Wang, “The training process of MAE models,” https: //github.com/Mikeahhh/MAE, retrieved on 2026-4-5

work page 2026

[1] [1]

Unmanned aerial vehicles for search and rescue: a survey,

M. Lyu, Y . Zhao, C. Huang, and H. Huang, “Unmanned aerial vehicles for search and rescue: a survey,”Remote Sens., vol. 15, no. 13, p. 3266, 2023

work page 2023

[2] [2]

Thermal, multispectral, and RGB vision systems analysis for victim detection in SAR robotics,

C. Cruz Ulloa, D. Orbea, J. del Cerro, and A. Barrientos, “Thermal, multispectral, and RGB vision systems analysis for victim detection in SAR robotics,”Appl. Sci., vol. 14, no. 2, p. 766, 2024

work page 2024

[3] [3]

Autonomous uav navigation using deep learning-based computer vision frameworks: A systematic literature review,

A. V . R. Katkuri, H. Madan, N. Khatri, A. S. H. Abdul-Qawy, and K. S. Patnaik, “Autonomous uav navigation using deep learning-based computer vision frameworks: A systematic literature review,”Array, vol. 23, p. 100361, 2024

work page 2024

[4] [4]

Enhancing search and rescue missions with uav thermal video tracking,

P. Fraternali, L. Morandini, and R. Motta, “Enhancing search and rescue missions with uav thermal video tracking,”Remote Sensing, vol. 17, no. 17, p. 3032, 2025

work page 2025

[5] [5]

Classification of unmanned aerial vehicles based on acoustic signals obtained in external environmental conditions,

M. Mikesikowska, “Classification of unmanned aerial vehicles based on acoustic signals obtained in external environmental conditions,”Sensors, vol. 24, no. 17, p. 5663, 2024

work page 2024

[6] [6]

Arbitrary microphone array optimization method based on TDOA for specific localization scenarios,

H. Liu, T. Kirubarajan, and Q. Xiao, “Arbitrary microphone array optimization method based on TDOA for specific localization scenarios,” Sensors, vol. 19, no. 19, p. 4326, 2019

work page 2019

[7] [7]

Drone-based sound source localization: a systematic literature review,

S. F. Chevtchenko, B. J. Rodr ´ıguez, R. Vale, A. Soti, Y . Bethi, N. Ibnul, A. Marcireau, M. R. Azghadi, A. Wabnitz, and S. Afshar, “Drone-based sound source localization: a systematic literature review,”IEEE Access, vol. 13, pp. 94 256–94 274, 2025

work page 2025

[8] [8]

Autonomous unmanned aerial vehicles in search and rescue missions using real-time cooperative model predictive control,

F. A. de Alcantara Andrade, A. Reinier Hovenburg, L. Netto de Lima, C. Dahlin Rodin, T. A. Johansen, R. Storvold, C. A. Moraes Correia, and D. Barreto Haddad, “Autonomous unmanned aerial vehicles in search and rescue missions using real-time cooperative model predictive control,”Sensors, vol. 19, no. 19, p. 4067, 2019

work page 2019

[9] [9]

Complementary materials,

Y . L. Y . Hong, M. Wang, “Complementary materials,” https://github. com/yalin-liu/spawc2026.git, retrieved on 2026-4-5

work page 2026

[10] [10]

Drone sound dataset,

——, “Drone sound dataset,” https://github.com/Mikeahhh/MAE, re- trieved on 2026-4-4

work page 2026

[11] [11]

Sound 645305 (desert environment),

DarkShroom, “Sound 645305 (desert environment),” https://freesound. org/people/DarkShroom/sounds/645305/, Freesound, 2022

work page 2022

[12] [12]

Desert wind stereo,

KasDonatov, “Desert wind stereo,” https://freesound.org/people/ KasDonatov/sounds/402710/, Freesound, 2017

work page 2017

[13] [13]

Forest ambiance sound effects,

Pixabay, “Forest ambiance sound effects,” https://pixabay.com/ sound-effects/search/forest/, Pixabay

work page

[14] [14]

Asvp-esd: A dataset and its benchmark for emotion recognition using both speech and non-speech utterances,

D. Landry, Q. He, H. Yan, and Y . Li, “Asvp-esd: A dataset and its benchmark for emotion recognition using both speech and non-speech utterances,”Global Scientific Journals, vol. 8, pp. 1793–1798, 2020

work page 2020

[15] [15]

Combined effects of sound and temperature on the composition and function of bacterial and fungal communities in loess,

L. Zhao, M. Li, Y . Wang, and L. Chen, “Combined effects of sound and temperature on the composition and function of bacterial and fungal communities in loess,”BMC microbiology, vol. 25, no. 1, p. 803, 2025

work page 2025

[16] [16]

Soundscape effects on visiting experience in city park: A case study in fuzhou, china,

J. Liu, Y . Xiong, Y . Wang, and T. Luo, “Soundscape effects on visiting experience in city park: A case study in fuzhou, china,”Urban forestry & urban greening, vol. 31, pp. 38–47, 2018

work page 2018

[17] [17]

Droneaudioset: An audio dataset for drone-based search and rescue,

C. Gupta, S. Ramesh, P. Sasikumar, K. P. Yeo, and S. Nanayakkara, “Droneaudioset: An audio dataset for drone-based search and rescue,” arXiv preprint arXiv:2510.15383, 2025

work page arXiv 2025

[18] [18]

Maximum averaged and peak levels of vocal sound pressure,

B. Boren, A. Roginska, and B. Gill, “Maximum averaged and peak levels of vocal sound pressure,” inAudio Engineering Society Convention 135. Audio Engineering Society, 2013

work page 2013

[19] [19]

The training process of MAE models,

Y . L. Y . Hong, M. Wang, “The training process of MAE models,” https: //github.com/Mikeahhh/MAE, retrieved on 2026-4-5

work page 2026