pith. sign in

arxiv: 2606.21210 · v2 · pith:ARBUIZYXnew · submitted 2026-06-19 · 💻 cs.CR

Impact Analysis of Speech Representation Learning Models for Acoustic Side-Channel Attack

Pith reviewed 2026-06-29 04:41 UTC · model grok-4.3

classification 💻 cs.CR
keywords acoustic side-channel attackspeech representation learningKolmogorov-Arnold Networkskeyboard eavesdroppingVoIP codecKEYAC datasetfine-tuning
0
0 comments X

The pith

Kolmogorov-Arnold Networks for fine-tuning speech models set new state-of-the-art on keyboard acoustic side-channel attacks even under VoIP codecs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the KEYAC dataset to measure how well speech representation learning models transfer to acoustic side-channel attacks that recover keystrokes from keyboard sounds, including when the audio has been compressed by VoIP codecs. Experiments with six models show that partial fine-tuning raises accuracy over zero-shot use, yet performance still falls when codecs are present. The authors identify the cause as weak modeling of nonlinear feature interactions inside standard fine-tuning heads and replace those heads with Kolmogorov-Arnold Networks. The KAN version raises accuracy across settings and records the highest scores yet reported on KEYAC. Readers may care because the work quantifies how readily everyday keyboard audio can leak typed text even after it travels through compressed voice channels.

Core claim

On the KEYAC dataset, replacing the final classifier with Kolmogorov-Arnold Networks during partial fine-tuning of speech representation models yields higher keystroke inference accuracy than fully connected or convolutional heads in both standard and VoIP-codec conditions, establishing a new state-of-the-art.

What carries the argument

Kolmogorov-Arnold Networks (KAN) used as the fine-tuning head to capture nonlinear feature interactions that standard networks miss when adapting speech representations to acoustic side-channel key inference.

If this is right

  • Partial fine-tuning of speech models improves keystroke inference accuracy over zero-shot transfer on KEYAC.
  • Performance remains limited across VoIP codecs when only conventional fine-tuning layers are used.
  • KAN-based fine-tuning produces the highest reported accuracy on KEYAC for both clean and codec-degraded audio.
  • The method shows that nonlinear modeling directly improves robustness of side-channel attacks to audio compression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same KAN replacement could be tested on other audio side-channel tasks that involve nonlinear distortions such as room reverberation or microphone frequency response.
  • KEYAC may become a standard benchmark for measuring how well audio foundation models resist codec or channel effects in security settings.
  • If KAN heads generalize, they might also raise performance when speech models are adapted to detect other subtle acoustic leaks such as printer or hard-drive sounds.

Load-bearing premise

The accuracy drop under VoIP codecs arises because conventional fine-tuning layers cannot adequately model nonlinear interactions among the extracted speech features.

What would settle it

Measure whether KAN fine-tuning still outperforms standard heads on a fresh keyboard acoustic dataset recorded through a different VoIP codec or with added background noise; if the gap closes or reverses, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2606.21210 by Arun Balaji Budhuru, Nitin Choudhury, Orchid Chetia Phukan, Vikrant Vikram Pratap Maurya.

Figure 1
Figure 1. Figure 1: Downstream Architectures Considered for Evaluation activations for binary and multiclass classification tasks, respec￾tively. For the CNN-based downstream, we use two convolu￾tional blocks, each consisting of a convolutional layer with ker￾nel size 3 followed by a max-pooling layer with pool size 2. The output from the final convolution block is then passed to the same classification layer as used in the F… view at source ↗
read the original abstract

Acoustic side-channel attacks (ASCA) on keyboards have gained increasing attention, yet impact of speech representation learning models in ASCA remains unexplored. Addressing this, we introduce KEYAC, a dataset designed to analyze representation generalization for ASCA under both standard and VoIP codec settings. On KEYAC, we evaluate six representation learning models under zero-shot and partial fine-tuning settings using fully connected and convolutional networks. Results show that while partial fine-tuning improves performance, models struggle to generalize across VoIP codecs. We hypothesize this limitation stems from inadequate modeling of nonlinear feature interactions in conventional fine-tuning architectures. To address this, we employ Kolmogorov-Arnold Networks (KAN) for fine-tuning. Empirical results show that KAN-based fine-tuning consistently outperforms the baselines and establishes a new state-of-the-art on KEYAC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces the KEYAC dataset to study the impact of speech representation learning models on acoustic side-channel attacks (ASCA) under standard and VoIP codec conditions. It evaluates six models in zero-shot and partial fine-tuning settings using FC and CNN heads, observes that partial fine-tuning improves performance but generalization to VoIP codecs remains challenging, hypothesizes that this is due to insufficient modeling of nonlinear feature interactions in standard architectures, and proposes Kolmogorov-Arnold Networks (KAN) for fine-tuning, reporting that KAN consistently outperforms the baselines and achieves new state-of-the-art results on KEYAC.

Significance. If the empirical results hold under proper controls, the paper contributes a new benchmark dataset for ASCA research and provides evidence that KAN-based fine-tuning can improve performance in this domain. The work highlights challenges in generalizing across VoIP codecs, which is relevant for real-world attack scenarios. The introduction of KEYAC is a clear strength as a reproducible resource for the community.

major comments (1)
  1. [Abstract] The central claim that KAN-based fine-tuning outperforms baselines specifically because it better models nonlinear feature interactions (addressing the VoIP gap) is not isolated by the experiments. No parameter-matched capacity controls, no comparisons to other nonlinear heads (e.g., deeper MLPs or attention), and no post-hoc analysis of learned activations or interaction terms are described. This leaves the explanatory link between hypothesis and result insecure.
minor comments (1)
  1. [Abstract] The abstract states results and a hypothesis but supplies no experimental details, error bars, dataset statistics, or ablation studies, which weakens verifiability of the reported outperformance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] The central claim that KAN-based fine-tuning outperforms baselines specifically because it better models nonlinear feature interactions (addressing the VoIP gap) is not isolated by the experiments. No parameter-matched capacity controls, no comparisons to other nonlinear heads (e.g., deeper MLPs or attention), and no post-hoc analysis of learned activations or interaction terms are described. This leaves the explanatory link between hypothesis and result insecure.

    Authors: We agree that the experiments do not isolate the explanatory mechanism. The manuscript reports consistent gains from KAN fine-tuning over the FC and CNN heads across the six representation models and both codec conditions, but does not include parameter-matched capacity controls, comparisons against other nonlinear heads such as deeper MLPs or attention layers, or post-hoc inspection of learned activations or interaction terms. The hypothesis is presented as motivation for trying KAN rather than as a claim proven by controlled ablation. We will revise the abstract and discussion sections to qualify the language, presenting KAN results as an empirical improvement that addresses the observed generalization difficulty without asserting that the improvement is specifically due to superior nonlinear interaction modeling. We will also add an explicit limitations paragraph noting the absence of the suggested controls and listing them as future work. This is a partial revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical claims rest on dataset experiments, not self-referential derivations

full rationale

The paper introduces KEYAC dataset and reports empirical comparisons of representation models under zero-shot and partial fine-tuning, plus KAN-based fine-tuning. No equations, fitted parameters renamed as predictions, or derivation chains appear. The hypothesis about nonlinear interactions motivates the KAN choice but is not used to derive results by construction; outcomes are measured directly on held-out data. No self-citations are load-bearing for any central claim. The work is self-contained against external benchmarks (KEYAC performance numbers).

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities; evaluation limited to abstract content only.

pith-pipeline@v0.9.1-grok · 5679 in / 972 out tokens · 36713 ms · 2026-06-29T04:41:28.884137+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 10 canonical work pages · 3 internal anchors

  1. [1]

    Impact Analysis of Speech Representation Learning Models for Acoustic Side-Channel Attack

    Introduction Keystroke sounds produced on a physical keyboard carry dis- tinctive acoustic signatures [1, 2, 3]. These signatures can be exploited to identify individual keystrokes, potentially compro- mising user security and privacy in real-world environments. Such attacks are commonly referred to as acoustic side-channel attacks (ASCA). In practice, us...

  2. [2]

    Pretrained Models We consider four self-supervised (SSL) speech PTM back- bones for evaluation, namely, Wav2Vec2, WavLM, HuBERT, and XLS-R

    Background 2.1. Pretrained Models We consider four self-supervised (SSL) speech PTM back- bones for evaluation, namely, Wav2Vec2, WavLM, HuBERT, and XLS-R. Wav2Vec2 learns contextualized speech represen- tations through a masked latent prediction framework opti- mized with a contrastive objective [7]. WavLM extends this framework by jointly learning acous...

  3. [3]

    Methodology 3.1. Dataset Existing Acoustic Side-Channel Attack (ASCA) datasets are of- ten limited by outdated hardware, small-scale collections, or evaluations restricted to basic keyboard generalization tasks. Moreover, prior datasets lack the diversity needed to study rep- resentation stability across modern telecommunication chan- nels. To address the...

  4. [4]

    The results are summarized in Table 1

    Results and Discussion We evaluate the models using accuracy and macro-F1 (mF1) to analyze their effectiveness across three generalization scenar- ios: keyboard generalization under standard recordings, key- board generalization under V oIP conditions, and codec gener- alization. The results are summarized in Table 1. Across all settings, the PTMs exhibit...

  5. [5]

    Conclusion In this work, we introducedKEYAC, a dataset designed to study acoustic side-channel attacks under realistic conditions, includ- ing cross-device variability and V oIP codec distortions. Using this dataset, we systematically evaluated multiple speech pre- trained models to examine their generalization capability for keystroke identification acro...

  6. [6]

    There is no AI contribution in ideation, methodological contribution, results, or content development

    Generative AI Use Disclosure Generative AI has been used only for linguistic editing and im- proving readability. There is no AI contribution in ideation, methodological contribution, results, or content development

  7. [7]

    Keyboard acoustic emanations,

    D. Asonov and R. Agrawal, “Keyboard acoustic emanations,” in Proceedings of the IEEE Symposium on Security and Privacy. IEEE, 2004, pp. 3–11

  8. [8]

    Keyboard acoustic emana- tions revisited,

    L. Zhuang, F. Zhou, and D. Tygar, “Keyboard acoustic emana- tions revisited,” inProceedings of the 2009 ACM Conference on Computer and Communications Security. ACM, 2009, pp. 373– 382

  9. [9]

    Decker: Domain-invariant embedding for cross-keyboard extrac- tion and recognition,

    B. B. P. Maurya, N. Choudhury, D. Agarwal, and A. B. Buduru, “Decker: Domain-invariant embedding for cross-keyboard extrac- tion and recognition,” inProceedings of the ACM Asia Conference on Computer and Communications Security, 2026, pp. 1707– 1720

  10. [10]

    A Practical Deep Learning-Based Acoustic Side Channel Attack on Keyboards,

    J. Harrison, E. Toreini, and M. Mehrnezhad, “A Practical Deep Learning-Based Acoustic Side Channel Attack on Keyboards,” in2023 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), Jul. 2023, pp. 270–280. [Online]. Available: http://arxiv.org/abs/2308.01074

  11. [11]

    Reflexnoop: Near-line-of-sight acoustic keystroke inference via laptop screen reflections,

    T. Maiet al., “Reflexnoop: Near-line-of-sight acoustic keystroke inference via laptop screen reflections,” ACM CCS / preprint, 2024, project / preprint (cite exact source when available)

  12. [12]

    Making acoustic side-channel attacks on noisy keyboards viable with llm-assisted spectrograms’

    S. A. Ayati, J. H. Park, Y . Cai, and M. Botacin, “Making acoustic side-channel attacks on noisy keyboards viable with llm-assisted spectrograms’” typo” correction,”arXiv preprint arXiv:2504.11622, 2025

  13. [13]

    wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020

  14. [14]

    Hubert: Self-supervised speech representa- tion learning by masked prediction of hidden units,

    W.-N. Hsu, B. Bolte, Y .-H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representa- tion learning by masked prediction of hidden units,” inICASSP / arXiv, 2021

  15. [15]

    Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

    S. Chenet al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,”IEEE/ACM Transactions on Au- dio, Speech and Language Processing, 2022

  16. [16]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

  17. [17]

    X-vectors: Robust dnn embeddings for speaker recognition,

    D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan- pur, “X-vectors: Robust dnn embeddings for speaker recognition,” in2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 5329–5333

  18. [18]

    Xls-r: Self- supervised cross-lingual speech representation learning at scale,

    A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. V on Platen, Y . Saraf, J. Pinoet al., “Xls-r: Self- supervised cross-lingual speech representation learning at scale,” arXiv preprint arXiv:2111.09296, 2021

  19. [19]

    Transforming the embeddings: A lightweight technique for speech emotion recog- nition tasks,

    O. C. Phukan, A. B. Buduru, and R. Sharma, “Transforming the embeddings: A lightweight technique for speech emotion recog- nition tasks,”arXiv preprint arXiv:2305.18640, 2023

  20. [20]

    KAN: Kolmogorov-Arnold Networks

    Z. Liu, Y . Wang, S. Vaidya, F. Ruehle, J. Halverson, M. Soljaˇci´c, T. Y . Hou, and M. Tegmark, “Kan: Kolmogorov-arnold net- works,”arXiv preprint arXiv:2404.19756, 2024

  21. [21]

    Lin, Andy T

    S.-w. Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, K. Lakho- tia, Y . Y . Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Linet al., “Superb: Speech processing universal performance benchmark,” arXiv preprint arXiv:2105.01051, 2021

  22. [22]

    Slue: New benchmark tasks for spoken language un- derstanding evaluation on natural speech,

    S. Shon, A. Pasad, F. Wu, P. Brusco, Y . Artzi, K. Livescu, and K. J. Han, “Slue: New benchmark tasks for spoken language un- derstanding evaluation on natural speech,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7927–7931

  23. [23]

    Slue phase-2: A benchmark suite of diverse spoken language understanding tasks,

    S. Shon, S. Arora, C.-J. Lin, A. Pasad, F. Wu, R. Sharma, W.- L. Wu, H.-y. Lee, K. Livescu, and S. Watanabe, “Slue phase-2: A benchmark suite of diverse spoken language understanding tasks,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 8906–8937

  24. [24]

    Classification of vocal inten- sity category from speech using the wav2vec2 and whisper em- beddings,

    M. Kodali, S. Kadiri, and P. Alku, “Classification of vocal inten- sity category from speech using the wav2vec2 and whisper em- beddings,” inInterspeech. International Speech Communication Association (ISCA), 2023, pp. 4134–4138

  25. [25]

    A large-scale evaluation of speech foundation models,

    S.-w. Yang, H.-J. Chang, Z. Huang, A. T. Liu, C.-I. Lai, H. Wu, J. Shi, X. Chang, H.-S. Tsai, W.-C. Huanget al., “A large-scale evaluation of speech foundation models,”IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 32, pp. 2884–2899, 2024

  26. [26]

    From raw speech to fixed representations: A comprehensive evaluation of speech em- bedding techniques,

    D. Porjazovski, T. Gr ´osz, and M. Kurimo, “From raw speech to fixed representations: A comprehensive evaluation of speech em- bedding techniques,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3546–3560, 2024

  27. [27]

    On the use of x-vectors for robust speaker recognition

    O. Novotn `y, O. Plchot, P. Matejka, L. Mosner, and O. Glem- bek, “On the use of x-vectors for robust speaker recognition.” in Odyssey, 2018, pp. 168–175

  28. [28]

    Loki: Low-dimensional kan for ef- ficient fine-tuning image models,

    X. Cai, R. Pan, and H. Yang, “Loki: Low-dimensional kan for ef- ficient fine-tuning image models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 14 869– 14 880

  29. [29]

    From kan to gr-kan: Advancing speech enhancement with kan- based methodology,

    H. Li, Y . Hu, C. Chen, S. M. Siniscalchi, S. Liu, and E. S. Chng, “From kan to gr-kan: Advancing speech enhancement with kan- based methodology,”arXiv preprint arXiv:2412.17778, 2024

  30. [30]

    Rethinking cross-corpus speech emo- tion recognition benchmarking: Are paralinguistic pre-trained representations sufficient?

    O. C. Phukan, M. M. Akhtar, S. R. Behera, P. Bhagath, P. B. Reddy, A. B. Buduruet al., “Rethinking cross-corpus speech emo- tion recognition benchmarking: Are paralinguistic pre-trained representations sufficient?” in2025 Asia Pacific Signal and In- formation Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2025, pp. 1022–1027

  31. [31]

    Foca: Multimodal malware classification via hyperbolic cross- attention,

    N. Choudhury, B. B. P. Maurya, O. C. Phukan, and A. B. Buduru, “Foca: Multimodal malware classification via hyperbolic cross- attention,”arXiv preprint arXiv:2601.17638, 2026

  32. [32]

    GoCoMA: Hyperbolic Multimodal Representation Fusion for Large Language Model-Generated Code Attribution

    N. Choudhury, B. B. P. Maurya, B. V . Kuwar, and A. B. Buduru, “Gocoma: Hyperbolic multimodal representation fu- sion for large language model-generated code attribution,”arXiv preprint arXiv:2604.16377, 2026