pith. sign in

arxiv: 2602.16399 · v2 · pith:44QQPHGOnew · submitted 2026-02-18 · 📡 eess.AS · cs.LG· cs.SD

Multi-Channel Replay Speech Detection using Acoustic Maps

Pith reviewed 2026-05-21 12:59 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SD
keywords replay attack detectionacoustic mapsmulti-channel audiobeamformingspeaker verificationconvolutional neural networkReMASC datasetspatial features
0
0 comments X

The pith

Acoustic maps from multi-channel beamforming detect replay attacks by capturing directional energy differences between live speech and loudspeakers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces acoustic maps as a spatial feature representation for identifying replayed speech in automatic speaker verification systems. These maps are created by applying classical beamforming across grids of azimuth and elevation angles to multi-channel recordings, which highlights how sound radiates differently from a human mouth compared to a loudspeaker. A small convolutional neural network then classifies these maps, reaching competitive results on the ReMASC dataset while using only around six thousand parameters. The approach targets real-time voice assistant security by offering a compact, physically grounded alternative to more complex models. Success would mean reliable detection that holds up across varied recording devices and room acoustics.

Core claim

Acoustic maps derived from classical beamforming over discrete azimuth and elevation grids encode directional energy distributions that reflect physical differences between human speech radiation and loudspeaker-based replay. A lightweight convolutional neural network operating on this representation achieves competitive performance on the ReMASC dataset with approximately 6k trainable parameters, demonstrating that acoustic maps provide a compact and physically interpretable feature space for replay attack detection across different devices and acoustic environments.

What carries the argument

Acoustic maps, spatial representations created by classical beamforming over discrete azimuth and elevation grids that encode directional energy distributions to highlight radiation pattern differences.

If this is right

  • Replay detection becomes feasible in real-time voice assistant applications due to the low parameter count.
  • The method maintains performance across different recording devices and acoustic environments.
  • The feature space remains compact and interpretable, aiding analysis of why a given recording is flagged as replay.
  • Physical grounding allows the detector to focus on radiation differences rather than content-specific cues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Acoustic maps could be fused with temporal or spectral features to handle edge cases like very close-range replays.
  • The beamforming grid resolution might be tuned per environment to further reduce false positives without increasing model size.
  • Similar directional representations might help detect other audio manipulations such as voice conversion in multi-microphone setups.

Load-bearing premise

Directional energy distributions from beamforming on multi-channel audio reliably encode consistent physical differences between how human voices radiate sound and how loudspeakers do.

What would settle it

Running the same lightweight CNN on acoustic maps from a new multi-channel replay dataset where performance drops below that of standard single-channel features would show the representation does not provide the claimed advantage.

Figures

Figures reproduced from arXiv: 2602.16399 by Michael Neri, Tuomas Virtanen.

Figure 1
Figure 1. Figure 1: Spatial distribution of acoustic maps from delay-and-sum beamformer across azimuth and elevation angles for two utterances from the ReMASC [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Microphone-wise performance in both generalization scenarios. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Replay attacks remain a critical vulnerability for automatic speaker verification systems, particularly in real-time voice assistant applications. In this work, we propose acoustic maps as a novel spatial feature representation for replay speech detection from multi-channel recordings. Derived from classical beamforming over discrete azimuth and elevation grids, acoustic maps encode directional energy distributions that reflect physical differences between human speech radiation and loudspeaker-based replay. A lightweight convolutional neural network is designed to operate on this representation, achieving competitive performance on the ReMASC dataset with approximately 6k trainable parameters. Experimental results show that acoustic maps provide a compact and physically interpretable feature space for replay attack detection across different devices and acoustic environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes acoustic maps, derived from classical beamforming over discrete azimuth and elevation grids applied to multi-channel recordings, as a compact spatial feature representation for replay speech detection. A lightweight CNN (~6k parameters) is applied to these maps on the ReMASC dataset, with the authors claiming competitive performance and physical interpretability arising from directional energy distributions that differ between human speech radiation patterns and loudspeaker replay.

Significance. If the physical-interpretability claim is substantiated with direct evidence, the approach could offer an efficient, interpretable alternative to spectral or learned features for anti-spoofing in multi-channel voice-assistant scenarios. The small model size and grounding in classical beamforming are strengths that support potential deployment; however, the absence of supporting analysis currently limits the assessed impact.

major comments (2)
  1. [Abstract] Abstract: The assertion that acoustic maps 'encode directional energy distributions that reflect physical differences between human speech radiation and loudspeaker-based replay' is presented as a core premise but is not accompanied by any map visualizations, directional energy statistics, or ablation isolating the spatial component from spectral cues. Performance numbers alone do not confirm this encoding.
  2. [Experimental results] Experimental results section: The manuscript reports competitive accuracy on ReMASC yet provides no quantitative metrics, baseline comparisons, error analysis, or details on data splits, preventing full evaluation of the central claim that acoustic maps form a physically interpretable feature space across devices and environments.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by replacing the qualitative phrase 'competitive performance' with specific accuracy or EER figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and indicate the revisions planned for the next version of the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that acoustic maps 'encode directional energy distributions that reflect physical differences between human speech radiation and loudspeaker-based replay' is presented as a core premise but is not accompanied by any map visualizations, directional energy statistics, or ablation isolating the spatial component from spectral cues. Performance numbers alone do not confirm this encoding.

    Authors: We agree that the interpretability claim requires direct supporting evidence beyond performance numbers. In the revised manuscript we will add visualizations of acoustic maps for genuine speech and replay samples, directional energy statistics across azimuth and elevation, and an ablation that compares the full acoustic-map input against a spectrally equivalent but spatially collapsed version. These additions will substantiate the encoding of physical radiation differences. revision: yes

  2. Referee: [Experimental results] Experimental results section: The manuscript reports competitive accuracy on ReMASC yet provides no quantitative metrics, baseline comparisons, error analysis, or details on data splits, preventing full evaluation of the central claim that acoustic maps form a physically interpretable feature space across devices and environments.

    Authors: We acknowledge that fuller experimental reporting is needed for independent evaluation. The revised version will report concrete metrics (accuracy, EER, AUC), explicit comparisons to single-channel spectral baselines and published multi-channel anti-spoofing methods, an error analysis stratified by device and environment, and the precise train/validation/test splits used on ReMASC. These details will allow readers to assess both performance and the claimed physical interpretability across conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: standard beamforming plus CNN on external dataset

full rationale

The derivation begins with classical beamforming on discrete azimuth/elevation grids to produce acoustic maps, followed by a lightweight CNN classifier. This chain uses an established, externally defined signal-processing technique and reports empirical results on the independent ReMASC benchmark. No equations, parameters, or claims reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The physical-interpretability premise is an empirical hypothesis tested by experiment rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that beamforming-derived directional energy maps capture physically meaningful differences between real and replayed speech; the CNN is a standard classifier with no additional invented components detailed.

axioms (1)
  • domain assumption Classical beamforming over discrete azimuth and elevation grids produces acoustic maps that encode directional energy distributions reflecting physical radiation differences.
    This premise is invoked to justify why the maps are useful for distinguishing human speech from loudspeaker replay.
invented entities (1)
  • Acoustic maps no independent evidence
    purpose: Compact spatial feature representation for replay speech detection.
    Introduced as a novel derived representation from existing beamforming; no independent falsifiable evidence outside the method is provided in the abstract.

pith-pipeline@v0.9.0 · 5634 in / 1246 out tokens · 79661 ms · 2026-05-21T12:59:27.145339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    Stop deceiving! an effective defense scheme against voice impersonation attacks on smart devices,

    W. Huang, W. Tang, H. Jiang, J. Luo, and Y . Zhang, “Stop deceiving! an effective defense scheme against voice impersonation attacks on smart devices,”IEEE Internet of Things Journal, vol. 9, no. 7, pp. 5304–5314, 2022

  2. [2]

    ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild,

    X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch, and K. A. Lee, “ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2507–2522, 2023

  3. [3]

    RedDots replayed: A new replay spoofing attack corpus for text-dependent speaker verification research,

    T. Kinnunen, M. Sahidullah, M. Falcone, L. Costantini, R. G. Hau- tam¨aki, D. Thomsen, A. Sarkar, Z. Tan, H. Delgado, M. Todisco, N. Evans, V . Hautam ¨aki, and K. A. Lee, “RedDots replayed: A new replay spoofing attack corpus for text-dependent speaker verification research,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICA...

  4. [4]

    Detecting replay attacks using multi-channel audio: A neural network-based method,

    Y . Gong, J. Yang, and C. Poellabauer, “Detecting replay attacks using multi-channel audio: A neural network-based method,”IEEE Signal Processing Letters, vol. 27, pp. 920–924, 2020

  5. [5]

    A survey of biometric recognition methods,

    K. Delac and M. Grgic, “A survey of biometric recognition methods,” in Proceedings. Elmar-2004. 46th International Symposium on Electronics in Marine, 2004

  6. [6]

    Multi-channel replay speech detection using an adaptive learnable beamformer,

    M. Neri and T. Virtanen, “Multi-channel replay speech detection using an adaptive learnable beamformer,”IEEE Open Journal of Signal Processing, pp. 1–7, 2025

  7. [7]

    Re- MASC: Realistic Replay Attack Corpus for V oice Controlled Systems,

    Y . Gong, J. Yang, J. Huber, M. MacKnight, and C. Poellabauer, “Re- MASC: Realistic Replay Attack Corpus for V oice Controlled Systems,” Interspeech, 2019

  8. [8]

    The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection,

    T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Ya- magishi, and K. A. Lee, “The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection,” inInterspeech, 2017

  9. [9]

    ASVspoof 2019: Future horizons in spoofed and fake audio detection,

    M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future horizons in spoofed and fake audio detection,” inInterspeech, 2019

  10. [10]

    Subband Channel Selection using TEO for Replay Spoof Detection in V oice Assistants,

    H. Kotta, A. T. Patil, R. Acharya, and H. A. Patil, “Subband Channel Selection using TEO for Replay Spoof Detection in V oice Assistants,” inAsia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2020

  11. [11]

    Cross-Teager Energy Cepstral Coefficients for Replay Spoof Detection on V oice Assistants,

    R. Acharya, H. Kotta, A. T. Patil, and H. A. Patil, “Cross-Teager Energy Cepstral Coefficients for Replay Spoof Detection on V oice Assistants,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6364–6368

  12. [12]

    Improving the potential of enhanced teager energy cepstral coefficients (ETECC) for replay attack detection,

    A. T. Patil, R. Acharya, H. A. Patil, and R. C. Guido, “Improving the potential of enhanced teager energy cepstral coefficients (ETECC) for replay attack detection,”Computer Speech & Language, vol. 72, p. 101281, 2022

  13. [13]

    Speech recognition with microphone arrays,

    M. Omologo, M. Matassoni, and P. Svaizer, “Speech recognition with microphone arrays,” inMicrophone arrays: signal processing techniques and applications. Springer, 2001, pp. 331–353

  14. [14]

    Impact of Microphone Array Mismatches to Learning-Based Replay Speech Detection,

    M. Neri and T. Virtanen, “Impact of Microphone Array Mismatches to Learning-Based Replay Speech Detection,” in2025 33rd European Signal Processing Conference (EUSIPCO), 2025

  15. [15]

    V oicelive: A phoneme localiza- tion based liveness detection for voice authentication on smartphones,

    L. Zhang, S. Tan, J. Yang, and Y . Chen, “V oicelive: A phoneme localiza- tion based liveness detection for voice authentication on smartphones,” inProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016, pp. 1080–1091

  16. [16]

    Acoustic Simulation Framework for Multi- channel Replay Speech Detection,

    M. Neri and T. Virtanen, “Acoustic Simulation Framework for Multi- channel Replay Speech Detection,”arXiv preprint arXiv:2509.14789, 2025

  17. [17]

    Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs),

    C. Djork-Arn ´e, T. Unterthiner, and S. Hochreiter, “Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs),” in International Conference on Learning Representations (ICLR), 2016

  18. [18]

    Low-Complexity Attention-Based Unsupervised Anomalous Sound Detection Exploiting Separable Convolutions and Angular Loss,

    M. Neri and M. Carli, “Low-Complexity Attention-Based Unsupervised Anomalous Sound Detection Exploiting Separable Convolutions and Angular Loss,”IEEE Sensors Letters, vol. 8, no. 11, pp. 1–4, 2024

  19. [19]

    mixup: Beyond empirical risk minimization,

    H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” inICLR, 2018