Recognition: unknown
DECKER: Domain-invariant Embedding for Cross-Keyboard Extraction and Recognition
Pith reviewed 2026-05-07 15:49 UTC · model grok-4.3
The pith
A four-stage embedding pipeline identifies keystrokes across different keyboards and users by removing device-specific acoustic features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DECKER applies keyboard signature normalization to reduce device coloration, domain-adversarial disentanglement to suppress keyboard identity, supervised cross-keyboard contrastive alignment to keep key identity consistent, and acoustic style randomization to handle unseen keyboards. When evaluated on the HEAR benchmark, this produces better keystroke identification than conventional features or pre-trained audio representations, especially in cross-keyboard and cross-user cases, with further gains from LLM-based rectification of full sequences using linguistic context.
What carries the argument
DECKER four-stage pipeline of normalization, adversarial disentanglement, contrastive alignment, and style randomization that isolates keystroke identity from keyboard-specific acoustic coloration.
If this is right
- Keystroke identification improves over strong baselines in cross-keyboard and cross-user conditions on the HEAR dataset.
- LLM-based sentence rectification adds measurable gains by correcting sequences with linguistic context.
- The approach works across the three capture settings of external microphones, device microphones, and VoIP streaming.
- Acoustic side-channel attacks remain effective even when users, keyboards, and noise conditions vary widely.
Where Pith is reading between the lines
- The same separation of identity from device signature could be tested on other acoustic emanations such as distinguishing different mechanical switches or recognizing spoken words despite microphone variation.
- Keyboard makers might adopt similar randomization during manufacturing to reduce unique acoustic fingerprints.
- Extending the style randomization stage to generate synthetic data for keyboards outside the original 37 would test broader generalization.
Load-bearing premise
The adversarial disentanglement and style randomization steps can remove keyboard-specific sound features while still keeping the acoustic details that distinguish one key from another.
What would settle it
Apply the trained DECKER model to keystroke recordings from a laptop keyboard model and microphone setup completely absent from the 37 keyboards in HEAR, then measure whether identification accuracy stays clearly higher than the paper's baseline feature and pre-trained representation methods.
Figures
read the original abstract
Acoustic side-channel attacks (ASCA) on keyboards pose a significant security risk, as keystrokes can be inferred from typing acoustics, revealing sensitive information. Prior ASCA studies are limited by small-scale datasets with restricted diversity in users, keyboards, and environments, constraining analysis across devices, microphones, and noise conditions. We introduce HEAR, a dataset designed to study ASCA along three axes: keyboard generalization, noise adaptation, and user bias. HEAR contains recordings from 53 participants using 37 laptop keyboards, collected in three realistic settings: (1) external microphone capture, (2) device microphone capture without network noise, and (3) VoIP-based streaming capture. This enables controlled evaluation across users, keyboards, and environments. On HEAR, we establish an ASCA benchmark spanning conventional features and pre-trained representations from raw audio and spectrograms in unimodal and multimodal settings. We propose DECKER, a domain-invariant keystroke inference framework with four stages: (1) Keyboard Signature Normalization to reduce device coloration, (2) domain-adversarial disentanglement to suppress keyboard identity, (3) supervised cross-keyboard contrastive alignment to enforce key consistency, and (4) Acoustic Style Randomization to synthesize unseen keyboard responses. We further explore sentence-level inference using an LLM-based post-processing layer to refine keystroke sequences via linguistic context. Results on HEAR show DECKER improves keystroke identification over strong baselines, particularly in cross-keyboard and cross-user settings, with further gains from language-model rectification. These findings highlight that ASCA remains effective across diverse users, devices, and noisy environments, underscoring its practical security risk.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the HEAR dataset of acoustic keystroke recordings from 53 participants across 37 laptop keyboards in three realistic capture settings (external microphone, device microphone, and VoIP streaming). It proposes DECKER, a four-stage domain-invariant embedding framework consisting of keyboard signature normalization, domain-adversarial disentanglement, supervised cross-keyboard contrastive alignment, and acoustic style randomization, augmented by an LLM-based post-processing layer for sequence rectification. The central claim is that DECKER yields improved keystroke identification over baselines on HEAR, particularly in cross-keyboard and cross-user settings.
Significance. If the reported gains are substantiated by quantitative metrics and the domain-invariance mechanism is verified, the work would establish a valuable large-scale benchmark for acoustic side-channel attacks and demonstrate that such attacks remain practical across diverse hardware and environments. The multi-stage pipeline and dataset release would strengthen the contribution to both security analysis and domain-adaptation methods in audio.
major comments (3)
- [§4.2] §4.2 (domain-adversarial disentanglement stage): the manuscript does not report the final keyboard-classification accuracy of a discriminator applied to the learned embeddings. Without evidence that this accuracy approaches the random baseline of approximately 2.7% for 37 classes, the claim that keyboard identity has been suppressed cannot be confirmed, leaving open the possibility that cross-keyboard gains arise from dataset correlations rather than the intended invariance.
- [§5] §5 (experimental results on HEAR): the central performance claims rest on improvements over baselines in cross-keyboard and cross-user settings, yet no ablation table isolating the contribution of each of the four stages, no definition of the strong baselines, and no quantitative metrics (e.g., accuracy deltas) are referenced in the evaluation. This absence prevents attribution of gains specifically to domain invariance versus the LLM rectification layer.
- [§3.4] §3.4 (acoustic style randomization): the stage is described as synthesizing unseen keyboard responses, but no implementation details, loss formulation, or validation that the randomization preserves key identity while varying device coloration are supplied. If the randomization inadvertently removes discriminative acoustic features, the contrastive alignment objective would be undermined.
minor comments (2)
- [Abstract] Abstract: performance gains are asserted without any numerical values or baseline names; adding a single sentence summarizing key metrics would improve clarity.
- [Method] Notation: the four stages are referred to inconsistently between the abstract and method description; a single numbered list or diagram would aid readability.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed review. We address each major comment below with clarifications and commit to revisions that directly strengthen the verification of our claims.
read point-by-point responses
-
Referee: [§4.2] §4.2 (domain-adversarial disentanglement stage): the manuscript does not report the final keyboard-classification accuracy of a discriminator applied to the learned embeddings. Without evidence that this accuracy approaches the random baseline of approximately 2.7% for 37 classes, the claim that keyboard identity has been suppressed cannot be confirmed, leaving open the possibility that cross-keyboard gains arise from dataset correlations rather than the intended invariance.
Authors: We agree that reporting the keyboard-classification accuracy of the discriminator on the final embeddings is necessary to rigorously confirm successful disentanglement. Although the manuscript emphasizes downstream keystroke identification, we will add this metric in the revised version, showing that accuracy approaches the random baseline of 1/37 ≈ 2.7% and thereby substantiating that keyboard identity has been suppressed rather than relying on dataset correlations. revision: yes
-
Referee: [§5] §5 (experimental results on HEAR): the central performance claims rest on improvements over baselines in cross-keyboard and cross-user settings, yet no ablation table isolating the contribution of each of the four stages, no definition of the strong baselines, and no quantitative metrics (e.g., accuracy deltas) are referenced in the evaluation. This absence prevents attribution of gains specifically to domain invariance versus the LLM rectification layer.
Authors: We acknowledge that the current presentation lacks sufficient granularity for attributing gains. In the revision we will insert a dedicated ablation table quantifying the incremental contribution of each of the four DECKER stages, explicitly define the strong baselines (including architectures, training protocols, and feature types), and report concrete accuracy values together with deltas for cross-keyboard and cross-user settings as well as the additional improvement from the LLM layer. revision: yes
-
Referee: [§3.4] §3.4 (acoustic style randomization): the stage is described as synthesizing unseen keyboard responses, but no implementation details, loss formulation, or validation that the randomization preserves key identity while varying device coloration are supplied. If the randomization inadvertently removes discriminative acoustic features, the contrastive alignment objective would be undermined.
Authors: We will expand Section 3.4 with the precise implementation details and loss formulation of the acoustic style randomization. We will also include validation experiments (e.g., key-classification accuracy before versus after randomization) demonstrating that key-discriminative information is retained while device coloration is varied, ensuring the subsequent contrastive alignment stage remains effective. revision: yes
Circularity Check
No circularity: standard domain-adversarial pipeline evaluated on new dataset
full rationale
The paper introduces the HEAR dataset and applies a four-stage ML pipeline (normalization, adversarial disentanglement, contrastive alignment, style randomization) plus optional LLM post-processing. No equations, fitted parameters, or self-citations are presented as load-bearing derivations that reduce the reported cross-keyboard gains to quantities defined solely by the method's own inputs. The central claims rest on empirical benchmark results rather than any self-referential construction or renaming of known patterns. This is the expected non-finding for an applied ML security paper whose improvements are measured externally against baselines on held-out data.
Axiom & Free-Parameter Ledger
free parameters (1)
- adversarial loss weight and contrastive temperature
axioms (1)
- domain assumption Acoustic features contain separable components for keyboard identity and individual keystroke identity
Reference graph
Works this paper leans on
-
[1]
Anonymous. 2023. Heimdall: Passive Acoustic Keystroke Eavesdropping under Non-Line-of-Sight Constraints. arXiv preprint. Preprint (cite official venue when available)
2023
-
[2]
Dmitri Asonov and Rakesh Agrawal. 2004. Keyboard Acoustic Emanations. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE, 3–11
2004
- [3]
-
[4]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representa- tions. InAdvances in Neural Information Processing Systems (NeurIPS)
2020
-
[5]
Sanyuan Chen et al. 2022. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing.IEEE/ACM Transactions on Audio, Speech and Language Processing(2022)
2022
-
[6]
Brecht Desplanques, Joren Thienpondt, and Luc Reynaert. 2020. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN based Speaker Verification. InProceedings of Interspeech
2020
-
[7]
2023.Beyond the Clicks: Exploring Key- board Acoustic Hacking
Abhishek Fadake and Prof Wadaganve. 2023.Beyond the Clicks: Exploring Key- board Acoustic Hacking. doi:10.13140/RG.2.2.36617.83048/2
-
[8]
Denis Foo Kune and Yongdae Kim. 2010. Timing attacks on pin input devices. In Proceedings of the 17th ACM conference on Computer and communications security. 678–680
2010
-
[9]
Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised Domain Adaptation by Backpropagation. InInternational Conference on Machine Learning (ICML)
2015
-
[10]
Gunasekaran et al
B. Gunasekaran et al. 2018. Continuous Acoustic Keystroke Detection on Mobile Phones. InProceedings of the Annual International Conference on Mobile Systems, Applications, and Services (MobiSys)
2018
-
[11]
Joshua Harrison, Ehsan Toreini, and Maryam Mehrnezhad. 2023. A Practical Deep Learning-Based Acoustic Side Channel Attack on Keyboards. In2023 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW). 270–280. doi:10.1109/EuroSPW59978.2023.00034
-
[12]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Tsai, Kushal Lakhotia, Ruslan Salakhut- dinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. InICASSP / arXiv
2021
-
[13]
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised Contrastive Learning. InAdvances in Neural Information Processing Systems (NeurIPS)
2020
-
[14]
Shuo Liu, Weize Quan, Yuan Liu, and Dong-Ming Yan. 2022. Bi-Directional Modality Fusion Network For Audio-Visual Event Localization. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4868–4872. doi:10.1109/ICASSP43922.2022.9746280
-
[15]
Mai et al
T. Mai et al. 2024. RefleXnoop: Near-Line-of-Sight Acoustic Keystroke Inference via Laptop Screen Reflections. ACM CCS / preprint. Project / preprint (cite exact source when available)
2024
-
[16]
Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D
Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. InProceedings of Interspeech
2019
-
[17]
2019.Language Models are Unsupervised Multitask Learners
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019.Language Models are Unsupervised Multitask Learners. Technical Report. OpenAI. GPT-2 technical report
2019
-
[18]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.Journal of Machine Learning Research (JMLR)(2020)
2020
-
[19]
Raghuram et al
N. Raghuram et al. 2023. Acoustic Keystroke Inference Using Deep Learning. Pattern Recognition Letters(2023)
2023
-
[20]
Rahman et al
Md. Rahman et al . 2021. I Hear Your Passwords: Acoustic Emanations from Laptop Keyboards. InUSENIX Security Symposium
2021
-
[21]
SomeAuthor and B
A. SomeAuthor and B. OtherAuthor. 2020. Acoustic Reflections and Multipath in Consumer Laptops. InProceedings of ACM/IEEE Sensys. Use exact citation if available
2020
-
[22]
Dawn Xiaodong Song, David Wagner, and Xuqing Tian. 2001. Timing analysis of keystrokes and timing attacks on {SSH}. In10th USENIX Security Symposium (USENIX Security 01)
2001
-
[23]
A. Trockman and J. Z. Kolter. 2022. Patches Are All You Need?arXiv preprint arXiv:2201.09792(2022)
-
[24]
Tianyi Zhang et al. 2023. Acoustic Side-Channel Attack on Modern Keyboards. Scientific Reports(2023)
2023
-
[25]
participant
Li Zhuang, Feng Zhou, and Doug Tygar. 2009. Keyboard Acoustic Emanations Revisited. InProceedings of the 2009 ACM Conference on Computer and Communi- cations Security. ACM, 373–382. Figure 4: t-SNE visualization of ECAPA-TDNN embeddings. Left: Without KSN, embeddings are dominated by keyboard- specific clustering. Right: With KSN enabled, keyboard- depend...
2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.