pith. sign in

arxiv: 2604.12455 · v1 · submitted 2026-04-14 · 📡 eess.AS

Sky-Ear: An Unmanned Aerial Vehicle-Enabled Victim Sound Detection and Localization System

Pith reviewed 2026-05-10 14:21 UTC · model grok-4.3

classification 📡 eess.AS
keywords UAVvictim sound detectionsound localizationsearch and rescuemicrophone arraymasking autoencoderenergy efficiencytwo-stage processing
0
0 comments X

The pith

Sky-Ear mounts a circular microphone array on a UAV and uses two-stage Sentinel-Responder processing with a masking autoencoder to detect and localize victim sounds energy-efficiently for search-and-rescue.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper designs Sky-Ear to solve the problem of continuous and reliable victim detection from UAVs despite hardware limits on power and sensors. It places a circular microphone array on the drone and splits audio handling into a Sentinel stage that applies a masking autoencoder to spot frequency-time patterns quickly and a Responder stage that refines location by combining direction estimates across several observations. The approach aims for both lower energy use and higher reliability than constant full-power listening. A reader would care because current UAV SAR efforts struggle with battery drain during long acoustic searches, and this method promises to keep listening active without exhausting the platform.

Core claim

The Sky-Ear system achieves energy-efficient acoustic sensing and sound detection for SAR by mounting a circular-shaped microphone array on a UAV and applying two-stage Sentinel and Responder audio processing. The Sentinel stage uses a Masking autoencoder-based method to analyze frequency-time acoustic features for initial detection. The Responder stage performs continuous localization by optimizing detected directions from multiple observations. Extensive simulation experiments validate the resulting victim detection accuracy and localization error.

What carries the argument

Two-stage Sentinel-Responder audio processing pipeline on a circular microphone array, where the Sentinel stage applies a masking autoencoder to frequency-time features and the Responder stage optimizes direction estimates across observations.

If this is right

  • The masking autoencoder in the Sentinel stage reduces continuous power draw while still catching victim sounds.
  • Optimizing directions from multiple observations lowers localization error compared to single-pass methods.
  • The circular array geometry supports reliable direction finding even when the UAV is moving.
  • Simulation-validated accuracy supports deployment in energy-constrained SAR missions.
  • Two-stage separation keeps the high-precision Responder stage inactive until a sound is flagged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Field tests in real wind and propeller noise would show how much the circular array's performance drops from the simulated ideal.
  • Pairing Sky-Ear with existing visual or thermal cameras on the same UAV could cut false positives by cross-checking audio alerts.
  • Adjusting the autoencoder training set to include more propeller noise samples might improve robustness on different drone models.

Load-bearing premise

That simulation experiments alone can confirm the system's victim detection accuracy and localization performance without modeling real UAV flight dynamics, wind noise, or onboard hardware limits.

What would settle it

Run the full Sky-Ear hardware on a physical UAV during controlled outdoor flights that replicate SAR conditions and measure whether actual detection accuracy and localization error match the reported simulation numbers.

Figures

Figures reproduced from arXiv: 2604.12455 by Kevin Hung, Mingyang Wang, Yalin Liu, Yaru Fu, Yi Hong.

Figure 1
Figure 1. Figure 1: An Unmanned Aerial Vehicle-Enabled Victim Sound Detection and Localization System (Sky-Ear). [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Anomaly detection accuracy of MAE models versus the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The continuous localization results of “Sky-Ear” along a UAV’s trajectory in SAR. Two scenarios, i.e., the desert and forest, are [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Unmanned Aerial Vehicles (UAVs) are increasingly deployed in search-and-rescue (SAR) missions, yet continuous and reliable victim detection and localization remain challenging due to on-board hardware constraints. This paper designs an UAV-Enabled Victim Sound Detection and Localization System (called ``Sky-Ear'' for brevity) to achieve energy-efficient acoustic sensing and sound detection for SAR. Based on a circular-shaped microphone array, two-stage (Sentinel and Responder) audio processing is developed for energy-consuming and highly reliable sound detection. A Masking autoencoder (MAE)-based sound detection method is designed in the Sentinel stage to analyze frequency-time acoustic features. For improved precision, a continuous localization method is designed by optimizing detected directions from multiple observations. Extensive simulation experiments are conducted to validate the system's performance in terms of victim detection accuracy and localization error.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes Sky-Ear, a UAV-based victim sound detection and localization system for search-and-rescue. It employs a circular microphone array with a two-stage Sentinel/Responder pipeline: the Sentinel stage uses a Masked Autoencoder (MAE) on frequency-time features for energy-efficient detection, while the Responder stage performs detailed analysis; localization optimizes directions across multiple observations. The central claim is that this architecture achieves reliable detection and low localization error, as demonstrated by extensive simulation experiments.

Significance. If the simulation results are shown to hold under realistic conditions, the two-stage pipeline and MAE-based detection offer a concrete, practical contribution to energy-constrained acoustic sensing on UAVs for SAR. The multi-observation localization approach is a sensible way to improve precision without continuous high-power processing.

major comments (1)
  1. [Simulation Experiments] Simulation validation section: The manuscript states that extensive simulations validate high victim detection accuracy and low localization error, yet provides no explicit description of how the acoustic model incorporates the dominant SAR interferers—strong time-varying rotor noise at the array, Doppler/phase shifts from UAV translation and rotation, or wind turbulence. Without these, the reported performance metrics cannot be taken as evidence that the Sentinel/Responder pipeline and MAE detector will deliver the claimed reliability under operational conditions. This directly undermines the central claim that the system is both energy-efficient and highly reliable.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'for energy-consuming and highly reliable sound detection' contradicts the earlier claim of 'energy-efficient acoustic sensing'; this appears to be a wording error that should be corrected for consistency.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the potential contribution of the two-stage pipeline and multi-observation localization. We address the major comment on the simulation validation below.

read point-by-point responses
  1. Referee: [Simulation Experiments] Simulation validation section: The manuscript states that extensive simulations validate high victim detection accuracy and low localization error, yet provides no explicit description of how the acoustic model incorporates the dominant SAR interferers—strong time-varying rotor noise at the array, Doppler/phase shifts from UAV translation and rotation, or wind turbulence. Without these, the reported performance metrics cannot be taken as evidence that the Sentinel/Responder pipeline and MAE detector will deliver the claimed reliability under operational conditions. This directly undermines the central claim that the system is both energy-efficient and highly reliable.

    Authors: We agree that the current description of the acoustic model in the simulation experiments section is insufficiently detailed. In the revised manuscript we will expand this section with an explicit description of the signal generation process, including the modeling of time-varying rotor noise at the microphone array, Doppler and phase shifts induced by UAV translation and rotation, and the effects of wind turbulence. These additions will clarify the simulation assumptions and allow readers to better assess the relevance of the reported detection accuracy and localization error to operational SAR conditions. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; system design uses standard components validated by simulation.

full rationale

The paper presents a UAV acoustic sensing system using a circular microphone array, a two-stage Sentinel/Responder pipeline, MAE-based detection on frequency-time features, and multi-observation direction optimization for localization. No equations, predictions, or first-principles derivations are described that reduce to fitted parameters or self-referential definitions. The approach applies established signal-processing and ML techniques without claiming novel mathematical results that loop back to inputs. Simulations are invoked for validation, but this is empirical testing rather than a derivation that is tautological by construction. No self-citation chains or uniqueness theorems are load-bearing in the provided text. The central claims rest on engineering choices and experimental outcomes, not on any step that is equivalent to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not introduce or rely on any free parameters, axioms, or invented entities beyond naming the system 'Sky-Ear'.

pith-pipeline@v0.9.0 · 5451 in / 1059 out tokens · 49869 ms · 2026-05-10T14:21:02.656585+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    Unmanned aerial vehicles for search and rescue: a survey,

    M. Lyu, Y . Zhao, C. Huang, and H. Huang, “Unmanned aerial vehicles for search and rescue: a survey,”Remote Sens., vol. 15, no. 13, p. 3266, 2023

  2. [2]

    Thermal, multispectral, and RGB vision systems analysis for victim detection in SAR robotics,

    C. Cruz Ulloa, D. Orbea, J. del Cerro, and A. Barrientos, “Thermal, multispectral, and RGB vision systems analysis for victim detection in SAR robotics,”Appl. Sci., vol. 14, no. 2, p. 766, 2024

  3. [3]

    Autonomous uav navigation using deep learning-based computer vision frameworks: A systematic literature review,

    A. V . R. Katkuri, H. Madan, N. Khatri, A. S. H. Abdul-Qawy, and K. S. Patnaik, “Autonomous uav navigation using deep learning-based computer vision frameworks: A systematic literature review,”Array, vol. 23, p. 100361, 2024

  4. [4]

    Enhancing search and rescue missions with uav thermal video tracking,

    P. Fraternali, L. Morandini, and R. Motta, “Enhancing search and rescue missions with uav thermal video tracking,”Remote Sensing, vol. 17, no. 17, p. 3032, 2025

  5. [5]

    Classification of unmanned aerial vehicles based on acoustic signals obtained in external environmental conditions,

    M. Mikesikowska, “Classification of unmanned aerial vehicles based on acoustic signals obtained in external environmental conditions,”Sensors, vol. 24, no. 17, p. 5663, 2024

  6. [6]

    Arbitrary microphone array optimization method based on TDOA for specific localization scenarios,

    H. Liu, T. Kirubarajan, and Q. Xiao, “Arbitrary microphone array optimization method based on TDOA for specific localization scenarios,” Sensors, vol. 19, no. 19, p. 4326, 2019

  7. [7]

    Drone-based sound source localization: a systematic literature review,

    S. F. Chevtchenko, B. J. Rodr ´ıguez, R. Vale, A. Soti, Y . Bethi, N. Ibnul, A. Marcireau, M. R. Azghadi, A. Wabnitz, and S. Afshar, “Drone-based sound source localization: a systematic literature review,”IEEE Access, vol. 13, pp. 94 256–94 274, 2025

  8. [8]

    Autonomous unmanned aerial vehicles in search and rescue missions using real-time cooperative model predictive control,

    F. A. de Alcantara Andrade, A. Reinier Hovenburg, L. Netto de Lima, C. Dahlin Rodin, T. A. Johansen, R. Storvold, C. A. Moraes Correia, and D. Barreto Haddad, “Autonomous unmanned aerial vehicles in search and rescue missions using real-time cooperative model predictive control,”Sensors, vol. 19, no. 19, p. 4067, 2019

  9. [9]

    Complementary materials,

    Y . L. Y . Hong, M. Wang, “Complementary materials,” https://github. com/yalin-liu/spawc2026.git, retrieved on 2026-4-5

  10. [10]

    Drone sound dataset,

    ——, “Drone sound dataset,” https://github.com/Mikeahhh/MAE, re- trieved on 2026-4-4

  11. [11]

    Sound 645305 (desert environment),

    DarkShroom, “Sound 645305 (desert environment),” https://freesound. org/people/DarkShroom/sounds/645305/, Freesound, 2022

  12. [12]

    Desert wind stereo,

    KasDonatov, “Desert wind stereo,” https://freesound.org/people/ KasDonatov/sounds/402710/, Freesound, 2017

  13. [13]

    Forest ambiance sound effects,

    Pixabay, “Forest ambiance sound effects,” https://pixabay.com/ sound-effects/search/forest/, Pixabay

  14. [14]

    Asvp-esd: A dataset and its benchmark for emotion recognition using both speech and non-speech utterances,

    D. Landry, Q. He, H. Yan, and Y . Li, “Asvp-esd: A dataset and its benchmark for emotion recognition using both speech and non-speech utterances,”Global Scientific Journals, vol. 8, pp. 1793–1798, 2020

  15. [15]

    Combined effects of sound and temperature on the composition and function of bacterial and fungal communities in loess,

    L. Zhao, M. Li, Y . Wang, and L. Chen, “Combined effects of sound and temperature on the composition and function of bacterial and fungal communities in loess,”BMC microbiology, vol. 25, no. 1, p. 803, 2025

  16. [16]

    Soundscape effects on visiting experience in city park: A case study in fuzhou, china,

    J. Liu, Y . Xiong, Y . Wang, and T. Luo, “Soundscape effects on visiting experience in city park: A case study in fuzhou, china,”Urban forestry & urban greening, vol. 31, pp. 38–47, 2018

  17. [17]

    Droneaudioset: An audio dataset for drone-based search and rescue,

    C. Gupta, S. Ramesh, P. Sasikumar, K. P. Yeo, and S. Nanayakkara, “Droneaudioset: An audio dataset for drone-based search and rescue,” arXiv preprint arXiv:2510.15383, 2025

  18. [18]

    Maximum averaged and peak levels of vocal sound pressure,

    B. Boren, A. Roginska, and B. Gill, “Maximum averaged and peak levels of vocal sound pressure,” inAudio Engineering Society Convention 135. Audio Engineering Society, 2013

  19. [19]

    The training process of MAE models,

    Y . L. Y . Hong, M. Wang, “The training process of MAE models,” https: //github.com/Mikeahhh/MAE, retrieved on 2026-4-5