Multi-Channel Replay Speech Detection using Acoustic Maps
Pith reviewed 2026-05-21 12:59 UTC · model grok-4.3
The pith
Acoustic maps from multi-channel beamforming detect replay attacks by capturing directional energy differences between live speech and loudspeakers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Acoustic maps derived from classical beamforming over discrete azimuth and elevation grids encode directional energy distributions that reflect physical differences between human speech radiation and loudspeaker-based replay. A lightweight convolutional neural network operating on this representation achieves competitive performance on the ReMASC dataset with approximately 6k trainable parameters, demonstrating that acoustic maps provide a compact and physically interpretable feature space for replay attack detection across different devices and acoustic environments.
What carries the argument
Acoustic maps, spatial representations created by classical beamforming over discrete azimuth and elevation grids that encode directional energy distributions to highlight radiation pattern differences.
If this is right
- Replay detection becomes feasible in real-time voice assistant applications due to the low parameter count.
- The method maintains performance across different recording devices and acoustic environments.
- The feature space remains compact and interpretable, aiding analysis of why a given recording is flagged as replay.
- Physical grounding allows the detector to focus on radiation differences rather than content-specific cues.
Where Pith is reading between the lines
- Acoustic maps could be fused with temporal or spectral features to handle edge cases like very close-range replays.
- The beamforming grid resolution might be tuned per environment to further reduce false positives without increasing model size.
- Similar directional representations might help detect other audio manipulations such as voice conversion in multi-microphone setups.
Load-bearing premise
Directional energy distributions from beamforming on multi-channel audio reliably encode consistent physical differences between how human voices radiate sound and how loudspeakers do.
What would settle it
Running the same lightweight CNN on acoustic maps from a new multi-channel replay dataset where performance drops below that of standard single-channel features would show the representation does not provide the claimed advantage.
Figures
read the original abstract
Replay attacks remain a critical vulnerability for automatic speaker verification systems, particularly in real-time voice assistant applications. In this work, we propose acoustic maps as a novel spatial feature representation for replay speech detection from multi-channel recordings. Derived from classical beamforming over discrete azimuth and elevation grids, acoustic maps encode directional energy distributions that reflect physical differences between human speech radiation and loudspeaker-based replay. A lightweight convolutional neural network is designed to operate on this representation, achieving competitive performance on the ReMASC dataset with approximately 6k trainable parameters. Experimental results show that acoustic maps provide a compact and physically interpretable feature space for replay attack detection across different devices and acoustic environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes acoustic maps, derived from classical beamforming over discrete azimuth and elevation grids applied to multi-channel recordings, as a compact spatial feature representation for replay speech detection. A lightweight CNN (~6k parameters) is applied to these maps on the ReMASC dataset, with the authors claiming competitive performance and physical interpretability arising from directional energy distributions that differ between human speech radiation patterns and loudspeaker replay.
Significance. If the physical-interpretability claim is substantiated with direct evidence, the approach could offer an efficient, interpretable alternative to spectral or learned features for anti-spoofing in multi-channel voice-assistant scenarios. The small model size and grounding in classical beamforming are strengths that support potential deployment; however, the absence of supporting analysis currently limits the assessed impact.
major comments (2)
- [Abstract] Abstract: The assertion that acoustic maps 'encode directional energy distributions that reflect physical differences between human speech radiation and loudspeaker-based replay' is presented as a core premise but is not accompanied by any map visualizations, directional energy statistics, or ablation isolating the spatial component from spectral cues. Performance numbers alone do not confirm this encoding.
- [Experimental results] Experimental results section: The manuscript reports competitive accuracy on ReMASC yet provides no quantitative metrics, baseline comparisons, error analysis, or details on data splits, preventing full evaluation of the central claim that acoustic maps form a physically interpretable feature space across devices and environments.
minor comments (1)
- [Abstract] The abstract would be strengthened by replacing the qualitative phrase 'competitive performance' with specific accuracy or EER figures.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and indicate the revisions planned for the next version of the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that acoustic maps 'encode directional energy distributions that reflect physical differences between human speech radiation and loudspeaker-based replay' is presented as a core premise but is not accompanied by any map visualizations, directional energy statistics, or ablation isolating the spatial component from spectral cues. Performance numbers alone do not confirm this encoding.
Authors: We agree that the interpretability claim requires direct supporting evidence beyond performance numbers. In the revised manuscript we will add visualizations of acoustic maps for genuine speech and replay samples, directional energy statistics across azimuth and elevation, and an ablation that compares the full acoustic-map input against a spectrally equivalent but spatially collapsed version. These additions will substantiate the encoding of physical radiation differences. revision: yes
-
Referee: [Experimental results] Experimental results section: The manuscript reports competitive accuracy on ReMASC yet provides no quantitative metrics, baseline comparisons, error analysis, or details on data splits, preventing full evaluation of the central claim that acoustic maps form a physically interpretable feature space across devices and environments.
Authors: We acknowledge that fuller experimental reporting is needed for independent evaluation. The revised version will report concrete metrics (accuracy, EER, AUC), explicit comparisons to single-channel spectral baselines and published multi-channel anti-spoofing methods, an error analysis stratified by device and environment, and the precise train/validation/test splits used on ReMASC. These details will allow readers to assess both performance and the claimed physical interpretability across conditions. revision: yes
Circularity Check
No circularity: standard beamforming plus CNN on external dataset
full rationale
The derivation begins with classical beamforming on discrete azimuth/elevation grids to produce acoustic maps, followed by a lightweight CNN classifier. This chain uses an established, externally defined signal-processing technique and reports empirical results on the independent ReMASC benchmark. No equations, parameters, or claims reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The physical-interpretability premise is an empirical hypothesis tested by experiment rather than a tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Classical beamforming over discrete azimuth and elevation grids produces acoustic maps that encode directional energy distributions reflecting physical radiation differences.
invented entities (1)
-
Acoustic maps
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Derived from classical beamforming over discrete azimuth and elevation grids, acoustic maps encode directional energy distributions that reflect physical differences between human speech radiation and loudspeaker-based replay.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Mm(a, e) = 1/|Bm|Ts ∑f∈Bm ∑t=1^Ts M(f,t,αa,βe)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Stop deceiving! an effective defense scheme against voice impersonation attacks on smart devices,
W. Huang, W. Tang, H. Jiang, J. Luo, and Y . Zhang, “Stop deceiving! an effective defense scheme against voice impersonation attacks on smart devices,”IEEE Internet of Things Journal, vol. 9, no. 7, pp. 5304–5314, 2022
work page 2022
-
[2]
ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild,
X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch, and K. A. Lee, “ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2507–2522, 2023
work page 2021
-
[3]
T. Kinnunen, M. Sahidullah, M. Falcone, L. Costantini, R. G. Hau- tam¨aki, D. Thomsen, A. Sarkar, Z. Tan, H. Delgado, M. Todisco, N. Evans, V . Hautam ¨aki, and K. A. Lee, “RedDots replayed: A new replay spoofing attack corpus for text-dependent speaker verification research,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICA...
work page 2017
-
[4]
Detecting replay attacks using multi-channel audio: A neural network-based method,
Y . Gong, J. Yang, and C. Poellabauer, “Detecting replay attacks using multi-channel audio: A neural network-based method,”IEEE Signal Processing Letters, vol. 27, pp. 920–924, 2020
work page 2020
-
[5]
A survey of biometric recognition methods,
K. Delac and M. Grgic, “A survey of biometric recognition methods,” in Proceedings. Elmar-2004. 46th International Symposium on Electronics in Marine, 2004
work page 2004
-
[6]
Multi-channel replay speech detection using an adaptive learnable beamformer,
M. Neri and T. Virtanen, “Multi-channel replay speech detection using an adaptive learnable beamformer,”IEEE Open Journal of Signal Processing, pp. 1–7, 2025
work page 2025
-
[7]
Re- MASC: Realistic Replay Attack Corpus for V oice Controlled Systems,
Y . Gong, J. Yang, J. Huber, M. MacKnight, and C. Poellabauer, “Re- MASC: Realistic Replay Attack Corpus for V oice Controlled Systems,” Interspeech, 2019
work page 2019
-
[8]
The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection,
T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Ya- magishi, and K. A. Lee, “The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection,” inInterspeech, 2017
work page 2017
-
[9]
ASVspoof 2019: Future horizons in spoofed and fake audio detection,
M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future horizons in spoofed and fake audio detection,” inInterspeech, 2019
work page 2019
-
[10]
Subband Channel Selection using TEO for Replay Spoof Detection in V oice Assistants,
H. Kotta, A. T. Patil, R. Acharya, and H. A. Patil, “Subband Channel Selection using TEO for Replay Spoof Detection in V oice Assistants,” inAsia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2020
work page 2020
-
[11]
Cross-Teager Energy Cepstral Coefficients for Replay Spoof Detection on V oice Assistants,
R. Acharya, H. Kotta, A. T. Patil, and H. A. Patil, “Cross-Teager Energy Cepstral Coefficients for Replay Spoof Detection on V oice Assistants,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6364–6368
work page 2021
-
[12]
A. T. Patil, R. Acharya, H. A. Patil, and R. C. Guido, “Improving the potential of enhanced teager energy cepstral coefficients (ETECC) for replay attack detection,”Computer Speech & Language, vol. 72, p. 101281, 2022
work page 2022
-
[13]
Speech recognition with microphone arrays,
M. Omologo, M. Matassoni, and P. Svaizer, “Speech recognition with microphone arrays,” inMicrophone arrays: signal processing techniques and applications. Springer, 2001, pp. 331–353
work page 2001
-
[14]
Impact of Microphone Array Mismatches to Learning-Based Replay Speech Detection,
M. Neri and T. Virtanen, “Impact of Microphone Array Mismatches to Learning-Based Replay Speech Detection,” in2025 33rd European Signal Processing Conference (EUSIPCO), 2025
work page 2025
-
[15]
L. Zhang, S. Tan, J. Yang, and Y . Chen, “V oicelive: A phoneme localiza- tion based liveness detection for voice authentication on smartphones,” inProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016, pp. 1080–1091
work page 2016
-
[16]
Acoustic Simulation Framework for Multi- channel Replay Speech Detection,
M. Neri and T. Virtanen, “Acoustic Simulation Framework for Multi- channel Replay Speech Detection,”arXiv preprint arXiv:2509.14789, 2025
-
[17]
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs),
C. Djork-Arn ´e, T. Unterthiner, and S. Hochreiter, “Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs),” in International Conference on Learning Representations (ICLR), 2016
work page 2016
-
[18]
M. Neri and M. Carli, “Low-Complexity Attention-Based Unsupervised Anomalous Sound Detection Exploiting Separable Convolutions and Angular Loss,”IEEE Sensors Letters, vol. 8, no. 11, pp. 1–4, 2024
work page 2024
-
[19]
mixup: Beyond empirical risk minimization,
H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” inICLR, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.