pith. machine review for the scientific record. sign in

arxiv: 2605.03384 · v1 · submitted 2026-05-05 · 💻 cs.CR · cs.SD

Recognition: unknown

DECKER: Domain-invariant Embedding for Cross-Keyboard Extraction and Recognition

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:49 UTC · model grok-4.3

classification 💻 cs.CR cs.SD
keywords acoustic side-channel attackskeystroke recognitiondomain adaptationadversarial disentanglementcross-keyboard generalizationHEAR datasetlanguage model rectification
0
0 comments X

The pith

A four-stage embedding pipeline identifies keystrokes across different keyboards and users by removing device-specific acoustic features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the HEAR dataset containing recordings from 53 users typing on 37 laptop keyboards across external microphones, device microphones, and VoIP streams. It proposes DECKER to turn raw typing audio into domain-invariant embeddings through four processing stages that strip away keyboard-specific sound coloration. This setup improves recognition accuracy when the keyboard or the person typing changes, and an added language-model step corrects likely errors in the resulting key sequences. The work shows that acoustic side-channel attacks can still extract sensitive typing information even under realistic variation in hardware and conditions.

Core claim

DECKER applies keyboard signature normalization to reduce device coloration, domain-adversarial disentanglement to suppress keyboard identity, supervised cross-keyboard contrastive alignment to keep key identity consistent, and acoustic style randomization to handle unseen keyboards. When evaluated on the HEAR benchmark, this produces better keystroke identification than conventional features or pre-trained audio representations, especially in cross-keyboard and cross-user cases, with further gains from LLM-based rectification of full sequences using linguistic context.

What carries the argument

DECKER four-stage pipeline of normalization, adversarial disentanglement, contrastive alignment, and style randomization that isolates keystroke identity from keyboard-specific acoustic coloration.

If this is right

  • Keystroke identification improves over strong baselines in cross-keyboard and cross-user conditions on the HEAR dataset.
  • LLM-based sentence rectification adds measurable gains by correcting sequences with linguistic context.
  • The approach works across the three capture settings of external microphones, device microphones, and VoIP streaming.
  • Acoustic side-channel attacks remain effective even when users, keyboards, and noise conditions vary widely.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of identity from device signature could be tested on other acoustic emanations such as distinguishing different mechanical switches or recognizing spoken words despite microphone variation.
  • Keyboard makers might adopt similar randomization during manufacturing to reduce unique acoustic fingerprints.
  • Extending the style randomization stage to generate synthetic data for keyboards outside the original 37 would test broader generalization.

Load-bearing premise

The adversarial disentanglement and style randomization steps can remove keyboard-specific sound features while still keeping the acoustic details that distinguish one key from another.

What would settle it

Apply the trained DECKER model to keystroke recordings from a laptop keyboard model and microphone setup completely absent from the 37 keyboards in HEAR, then measure whether identification accuracy stays clearly higher than the paper's baseline feature and pre-trained representation methods.

Figures

Figures reproduced from arXiv: 2605.03384 by Arun Balaji Buduru, Bikrant Bikram Pratap Maurya, Daksh Agarwal, Nitin Choudhury.

Figure 1
Figure 1. Figure 1: Threat scenario. Bob types on a laptop in a pub view at source ↗
Figure 2
Figure 2. Figure 2: DECKER pipeline. (1) Raw keystrokes are normalized with KSN, (2) augmented using ASR, (3) encoded with ECAPA view at source ↗
Figure 3
Figure 3. Figure 3: KSN suppresses keyboard-specific spectral coloration. The device-colored sample shows strong resonance bands that view at source ↗
Figure 4
Figure 4. Figure 4: t-SNE visualization of ECAPA-TDNN embeddings. view at source ↗
read the original abstract

Acoustic side-channel attacks (ASCA) on keyboards pose a significant security risk, as keystrokes can be inferred from typing acoustics, revealing sensitive information. Prior ASCA studies are limited by small-scale datasets with restricted diversity in users, keyboards, and environments, constraining analysis across devices, microphones, and noise conditions. We introduce HEAR, a dataset designed to study ASCA along three axes: keyboard generalization, noise adaptation, and user bias. HEAR contains recordings from 53 participants using 37 laptop keyboards, collected in three realistic settings: (1) external microphone capture, (2) device microphone capture without network noise, and (3) VoIP-based streaming capture. This enables controlled evaluation across users, keyboards, and environments. On HEAR, we establish an ASCA benchmark spanning conventional features and pre-trained representations from raw audio and spectrograms in unimodal and multimodal settings. We propose DECKER, a domain-invariant keystroke inference framework with four stages: (1) Keyboard Signature Normalization to reduce device coloration, (2) domain-adversarial disentanglement to suppress keyboard identity, (3) supervised cross-keyboard contrastive alignment to enforce key consistency, and (4) Acoustic Style Randomization to synthesize unseen keyboard responses. We further explore sentence-level inference using an LLM-based post-processing layer to refine keystroke sequences via linguistic context. Results on HEAR show DECKER improves keystroke identification over strong baselines, particularly in cross-keyboard and cross-user settings, with further gains from language-model rectification. These findings highlight that ASCA remains effective across diverse users, devices, and noisy environments, underscoring its practical security risk.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the HEAR dataset of acoustic keystroke recordings from 53 participants across 37 laptop keyboards in three realistic capture settings (external microphone, device microphone, and VoIP streaming). It proposes DECKER, a four-stage domain-invariant embedding framework consisting of keyboard signature normalization, domain-adversarial disentanglement, supervised cross-keyboard contrastive alignment, and acoustic style randomization, augmented by an LLM-based post-processing layer for sequence rectification. The central claim is that DECKER yields improved keystroke identification over baselines on HEAR, particularly in cross-keyboard and cross-user settings.

Significance. If the reported gains are substantiated by quantitative metrics and the domain-invariance mechanism is verified, the work would establish a valuable large-scale benchmark for acoustic side-channel attacks and demonstrate that such attacks remain practical across diverse hardware and environments. The multi-stage pipeline and dataset release would strengthen the contribution to both security analysis and domain-adaptation methods in audio.

major comments (3)
  1. [§4.2] §4.2 (domain-adversarial disentanglement stage): the manuscript does not report the final keyboard-classification accuracy of a discriminator applied to the learned embeddings. Without evidence that this accuracy approaches the random baseline of approximately 2.7% for 37 classes, the claim that keyboard identity has been suppressed cannot be confirmed, leaving open the possibility that cross-keyboard gains arise from dataset correlations rather than the intended invariance.
  2. [§5] §5 (experimental results on HEAR): the central performance claims rest on improvements over baselines in cross-keyboard and cross-user settings, yet no ablation table isolating the contribution of each of the four stages, no definition of the strong baselines, and no quantitative metrics (e.g., accuracy deltas) are referenced in the evaluation. This absence prevents attribution of gains specifically to domain invariance versus the LLM rectification layer.
  3. [§3.4] §3.4 (acoustic style randomization): the stage is described as synthesizing unseen keyboard responses, but no implementation details, loss formulation, or validation that the randomization preserves key identity while varying device coloration are supplied. If the randomization inadvertently removes discriminative acoustic features, the contrastive alignment objective would be undermined.
minor comments (2)
  1. [Abstract] Abstract: performance gains are asserted without any numerical values or baseline names; adding a single sentence summarizing key metrics would improve clarity.
  2. [Method] Notation: the four stages are referred to inconsistently between the abstract and method description; a single numbered list or diagram would aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. We address each major comment below with clarifications and commit to revisions that directly strengthen the verification of our claims.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (domain-adversarial disentanglement stage): the manuscript does not report the final keyboard-classification accuracy of a discriminator applied to the learned embeddings. Without evidence that this accuracy approaches the random baseline of approximately 2.7% for 37 classes, the claim that keyboard identity has been suppressed cannot be confirmed, leaving open the possibility that cross-keyboard gains arise from dataset correlations rather than the intended invariance.

    Authors: We agree that reporting the keyboard-classification accuracy of the discriminator on the final embeddings is necessary to rigorously confirm successful disentanglement. Although the manuscript emphasizes downstream keystroke identification, we will add this metric in the revised version, showing that accuracy approaches the random baseline of 1/37 ≈ 2.7% and thereby substantiating that keyboard identity has been suppressed rather than relying on dataset correlations. revision: yes

  2. Referee: [§5] §5 (experimental results on HEAR): the central performance claims rest on improvements over baselines in cross-keyboard and cross-user settings, yet no ablation table isolating the contribution of each of the four stages, no definition of the strong baselines, and no quantitative metrics (e.g., accuracy deltas) are referenced in the evaluation. This absence prevents attribution of gains specifically to domain invariance versus the LLM rectification layer.

    Authors: We acknowledge that the current presentation lacks sufficient granularity for attributing gains. In the revision we will insert a dedicated ablation table quantifying the incremental contribution of each of the four DECKER stages, explicitly define the strong baselines (including architectures, training protocols, and feature types), and report concrete accuracy values together with deltas for cross-keyboard and cross-user settings as well as the additional improvement from the LLM layer. revision: yes

  3. Referee: [§3.4] §3.4 (acoustic style randomization): the stage is described as synthesizing unseen keyboard responses, but no implementation details, loss formulation, or validation that the randomization preserves key identity while varying device coloration are supplied. If the randomization inadvertently removes discriminative acoustic features, the contrastive alignment objective would be undermined.

    Authors: We will expand Section 3.4 with the precise implementation details and loss formulation of the acoustic style randomization. We will also include validation experiments (e.g., key-classification accuracy before versus after randomization) demonstrating that key-discriminative information is retained while device coloration is varied, ensuring the subsequent contrastive alignment stage remains effective. revision: yes

Circularity Check

0 steps flagged

No circularity: standard domain-adversarial pipeline evaluated on new dataset

full rationale

The paper introduces the HEAR dataset and applies a four-stage ML pipeline (normalization, adversarial disentanglement, contrastive alignment, style randomization) plus optional LLM post-processing. No equations, fitted parameters, or self-citations are presented as load-bearing derivations that reduce the reported cross-keyboard gains to quantities defined solely by the method's own inputs. The central claims rest on empirical benchmark results rather than any self-referential construction or renaming of known patterns. This is the expected non-finding for an applied ML security paper whose improvements are measured externally against baselines on held-out data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that keyboard identity and keystroke identity are separable in the acoustic signal; no new physical entities are postulated and free parameters are the usual ML hyperparameters whose values are not reported in the abstract.

free parameters (1)
  • adversarial loss weight and contrastive temperature
    Control the strength of domain disentanglement and key alignment; their specific values are not stated in the abstract.
axioms (1)
  • domain assumption Acoustic features contain separable components for keyboard identity and individual keystroke identity
    Invoked by the design of domain-adversarial disentanglement and cross-keyboard contrastive alignment stages.

pith-pipeline@v0.9.0 · 5616 in / 1424 out tokens · 59046 ms · 2026-05-07T15:49:54.361151+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 5 canonical work pages

  1. [1]

    Anonymous. 2023. Heimdall: Passive Acoustic Keystroke Eavesdropping under Non-Line-of-Sight Constraints. arXiv preprint. Preprint (cite official venue when available)

  2. [2]

    Dmitri Asonov and Rakesh Agrawal. 2004. Keyboard Acoustic Emanations. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE, 3–11

  3. [3]

    Seyyed Ali Ayati, Jin Hyun Park, Yichen Cai, and Marcus Botacin. 2025. Making Acoustic Side-Channel Attacks on Noisy Keyboards Viable with LLM-Assisted Spectrograms’" Typo" Correction.arXiv preprint arXiv:2504.11622(2025)

  4. [4]

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representa- tions. InAdvances in Neural Information Processing Systems (NeurIPS)

  5. [5]

    Sanyuan Chen et al. 2022. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing.IEEE/ACM Transactions on Audio, Speech and Language Processing(2022)

  6. [6]

    Brecht Desplanques, Joren Thienpondt, and Luc Reynaert. 2020. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN based Speaker Verification. InProceedings of Interspeech

  7. [7]

    2023.Beyond the Clicks: Exploring Key- board Acoustic Hacking

    Abhishek Fadake and Prof Wadaganve. 2023.Beyond the Clicks: Exploring Key- board Acoustic Hacking. doi:10.13140/RG.2.2.36617.83048/2

  8. [8]

    Denis Foo Kune and Yongdae Kim. 2010. Timing attacks on pin input devices. In Proceedings of the 17th ACM conference on Computer and communications security. 678–680

  9. [9]

    Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised Domain Adaptation by Backpropagation. InInternational Conference on Machine Learning (ICML)

  10. [10]

    Gunasekaran et al

    B. Gunasekaran et al. 2018. Continuous Acoustic Keystroke Detection on Mobile Phones. InProceedings of the Annual International Conference on Mobile Systems, Applications, and Services (MobiSys)

  11. [11]

    Joshua Harrison, Ehsan Toreini, and Maryam Mehrnezhad. 2023. A Practical Deep Learning-Based Acoustic Side Channel Attack on Keyboards. In2023 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW). 270–280. doi:10.1109/EuroSPW59978.2023.00034

  12. [12]

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Tsai, Kushal Lakhotia, Ruslan Salakhut- dinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. InICASSP / arXiv

  13. [13]

    Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised Contrastive Learning. InAdvances in Neural Information Processing Systems (NeurIPS)

  14. [14]

    Shuo Liu, Weize Quan, Yuan Liu, and Dong-Ming Yan. 2022. Bi-Directional Modality Fusion Network For Audio-Visual Event Localization. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4868–4872. doi:10.1109/ICASSP43922.2022.9746280

  15. [15]

    Mai et al

    T. Mai et al. 2024. RefleXnoop: Near-Line-of-Sight Acoustic Keystroke Inference via Laptop Screen Reflections. ACM CCS / preprint. Project / preprint (cite exact source when available)

  16. [16]

    Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D

    Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. InProceedings of Interspeech

  17. [17]

    2019.Language Models are Unsupervised Multitask Learners

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019.Language Models are Unsupervised Multitask Learners. Technical Report. OpenAI. GPT-2 technical report

  18. [18]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.Journal of Machine Learning Research (JMLR)(2020)

  19. [19]

    Raghuram et al

    N. Raghuram et al. 2023. Acoustic Keystroke Inference Using Deep Learning. Pattern Recognition Letters(2023)

  20. [20]

    Rahman et al

    Md. Rahman et al . 2021. I Hear Your Passwords: Acoustic Emanations from Laptop Keyboards. InUSENIX Security Symposium

  21. [21]

    SomeAuthor and B

    A. SomeAuthor and B. OtherAuthor. 2020. Acoustic Reflections and Multipath in Consumer Laptops. InProceedings of ACM/IEEE Sensys. Use exact citation if available

  22. [22]

    Dawn Xiaodong Song, David Wagner, and Xuqing Tian. 2001. Timing analysis of keystrokes and timing attacks on {SSH}. In10th USENIX Security Symposium (USENIX Security 01)

  23. [23]

    and Kolter, J

    A. Trockman and J. Z. Kolter. 2022. Patches Are All You Need?arXiv preprint arXiv:2201.09792(2022)

  24. [24]

    Tianyi Zhang et al. 2023. Acoustic Side-Channel Attack on Modern Keyboards. Scientific Reports(2023)

  25. [25]

    participant

    Li Zhuang, Feng Zhou, and Doug Tygar. 2009. Keyboard Acoustic Emanations Revisited. InProceedings of the 2009 ACM Conference on Computer and Communi- cations Security. ACM, 373–382. Figure 4: t-SNE visualization of ECAPA-TDNN embeddings. Left: Without KSN, embeddings are dominated by keyboard- specific clustering. Right: With KSN enabled, keyboard- depend...