arxiv: 2605.03384 · v1 · submitted 2026-05-05 · 💻 cs.CR · cs.SD

Recognition: unknown

DECKER: Domain-invariant Embedding for Cross-Keyboard Extraction and Recognition

Bikrant Bikram Pratap Maurya , Nitin Choudhury , Daksh Agarwal , Arun Balaji Buduru

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:49 UTC · model grok-4.3

classification 💻 cs.CR cs.SD

keywords acoustic side-channel attackskeystroke recognitiondomain adaptationadversarial disentanglementcross-keyboard generalizationHEAR datasetlanguage model rectification

0 comments

The pith

A four-stage embedding pipeline identifies keystrokes across different keyboards and users by removing device-specific acoustic features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the HEAR dataset containing recordings from 53 users typing on 37 laptop keyboards across external microphones, device microphones, and VoIP streams. It proposes DECKER to turn raw typing audio into domain-invariant embeddings through four processing stages that strip away keyboard-specific sound coloration. This setup improves recognition accuracy when the keyboard or the person typing changes, and an added language-model step corrects likely errors in the resulting key sequences. The work shows that acoustic side-channel attacks can still extract sensitive typing information even under realistic variation in hardware and conditions.

Core claim

DECKER applies keyboard signature normalization to reduce device coloration, domain-adversarial disentanglement to suppress keyboard identity, supervised cross-keyboard contrastive alignment to keep key identity consistent, and acoustic style randomization to handle unseen keyboards. When evaluated on the HEAR benchmark, this produces better keystroke identification than conventional features or pre-trained audio representations, especially in cross-keyboard and cross-user cases, with further gains from LLM-based rectification of full sequences using linguistic context.

What carries the argument

DECKER four-stage pipeline of normalization, adversarial disentanglement, contrastive alignment, and style randomization that isolates keystroke identity from keyboard-specific acoustic coloration.

If this is right

Keystroke identification improves over strong baselines in cross-keyboard and cross-user conditions on the HEAR dataset.
LLM-based sentence rectification adds measurable gains by correcting sequences with linguistic context.
The approach works across the three capture settings of external microphones, device microphones, and VoIP streaming.
Acoustic side-channel attacks remain effective even when users, keyboards, and noise conditions vary widely.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of identity from device signature could be tested on other acoustic emanations such as distinguishing different mechanical switches or recognizing spoken words despite microphone variation.
Keyboard makers might adopt similar randomization during manufacturing to reduce unique acoustic fingerprints.
Extending the style randomization stage to generate synthetic data for keyboards outside the original 37 would test broader generalization.

Load-bearing premise

The adversarial disentanglement and style randomization steps can remove keyboard-specific sound features while still keeping the acoustic details that distinguish one key from another.

What would settle it

Apply the trained DECKER model to keystroke recordings from a laptop keyboard model and microphone setup completely absent from the 37 keyboards in HEAR, then measure whether identification accuracy stays clearly higher than the paper's baseline feature and pre-trained representation methods.

Figures

Figures reproduced from arXiv: 2605.03384 by Arun Balaji Buduru, Bikrant Bikram Pratap Maurya, Daksh Agarwal, Nitin Choudhury.

**Figure 1.** Figure 1: Threat scenario. Bob types on a laptop in a pub view at source ↗

**Figure 2.** Figure 2: DECKER pipeline. (1) Raw keystrokes are normalized with KSN, (2) augmented using ASR, (3) encoded with ECAPA view at source ↗

**Figure 3.** Figure 3: KSN suppresses keyboard-specific spectral coloration. The device-colored sample shows strong resonance bands that view at source ↗

**Figure 4.** Figure 4: t-SNE visualization of ECAPA-TDNN embeddings. view at source ↗

read the original abstract

Acoustic side-channel attacks (ASCA) on keyboards pose a significant security risk, as keystrokes can be inferred from typing acoustics, revealing sensitive information. Prior ASCA studies are limited by small-scale datasets with restricted diversity in users, keyboards, and environments, constraining analysis across devices, microphones, and noise conditions. We introduce HEAR, a dataset designed to study ASCA along three axes: keyboard generalization, noise adaptation, and user bias. HEAR contains recordings from 53 participants using 37 laptop keyboards, collected in three realistic settings: (1) external microphone capture, (2) device microphone capture without network noise, and (3) VoIP-based streaming capture. This enables controlled evaluation across users, keyboards, and environments. On HEAR, we establish an ASCA benchmark spanning conventional features and pre-trained representations from raw audio and spectrograms in unimodal and multimodal settings. We propose DECKER, a domain-invariant keystroke inference framework with four stages: (1) Keyboard Signature Normalization to reduce device coloration, (2) domain-adversarial disentanglement to suppress keyboard identity, (3) supervised cross-keyboard contrastive alignment to enforce key consistency, and (4) Acoustic Style Randomization to synthesize unseen keyboard responses. We further explore sentence-level inference using an LLM-based post-processing layer to refine keystroke sequences via linguistic context. Results on HEAR show DECKER improves keystroke identification over strong baselines, particularly in cross-keyboard and cross-user settings, with further gains from language-model rectification. These findings highlight that ASCA remains effective across diverse users, devices, and noisy environments, underscoring its practical security risk.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The HEAR dataset is the real addition here for acoustic side-channel research, but the DECKER claims rest on unshown numbers and an unverified adversarial step.

read the letter

The main thing to know is that this paper builds a noticeably larger dataset than prior acoustic side-channel work, with recordings from 53 participants across 37 keyboards in three capture modes that include external mics, device mics, and VoIP streams. That variety directly addresses the small-scale limitation that has kept most earlier studies in the lab. DECKER layers four standard pieces—normalization, adversarial disentanglement, supervised contrastive alignment, and style randomization—then adds an LLM post-processor for sequence cleanup, and the abstract says this beats baselines especially in cross-keyboard and cross-user tests. The setup is sensible on paper for trying to keep key identity while dropping device coloration. The multimodal collection and the explicit focus on real-world noise and user variation are the parts that feel like forward movement. The soft spots are straightforward. The abstract supplies no accuracy figures, no ablation tables, and no definition of the baselines, so the performance gains cannot be checked. The adversarial stage is the load-bearing piece for domain invariance, yet these objectives frequently fail to drive the discriminator all the way to chance level when capacities or schedules are off; without a reported keyboard-classifier accuracy on the final embeddings it is hard to know whether the reported cross-device lift comes from true invariance or from dataset correlations and the LLM layer. The stress-test worry about residual keyboard signals therefore looks like a live concern rather than a minor one. This is for security researchers and audio ML people who want bigger data to work with on side channels. The dataset itself could be useful to others even if the method needs more evidence. I would send it for peer review because the scale and the practical evaluation axes are substantial enough to justify referee time, though the experimental section will need to be filled in before it can stand on its own.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the HEAR dataset of acoustic keystroke recordings from 53 participants across 37 laptop keyboards in three realistic capture settings (external microphone, device microphone, and VoIP streaming). It proposes DECKER, a four-stage domain-invariant embedding framework consisting of keyboard signature normalization, domain-adversarial disentanglement, supervised cross-keyboard contrastive alignment, and acoustic style randomization, augmented by an LLM-based post-processing layer for sequence rectification. The central claim is that DECKER yields improved keystroke identification over baselines on HEAR, particularly in cross-keyboard and cross-user settings.

Significance. If the reported gains are substantiated by quantitative metrics and the domain-invariance mechanism is verified, the work would establish a valuable large-scale benchmark for acoustic side-channel attacks and demonstrate that such attacks remain practical across diverse hardware and environments. The multi-stage pipeline and dataset release would strengthen the contribution to both security analysis and domain-adaptation methods in audio.

major comments (3)

[§4.2] §4.2 (domain-adversarial disentanglement stage): the manuscript does not report the final keyboard-classification accuracy of a discriminator applied to the learned embeddings. Without evidence that this accuracy approaches the random baseline of approximately 2.7% for 37 classes, the claim that keyboard identity has been suppressed cannot be confirmed, leaving open the possibility that cross-keyboard gains arise from dataset correlations rather than the intended invariance.
[§5] §5 (experimental results on HEAR): the central performance claims rest on improvements over baselines in cross-keyboard and cross-user settings, yet no ablation table isolating the contribution of each of the four stages, no definition of the strong baselines, and no quantitative metrics (e.g., accuracy deltas) are referenced in the evaluation. This absence prevents attribution of gains specifically to domain invariance versus the LLM rectification layer.
[§3.4] §3.4 (acoustic style randomization): the stage is described as synthesizing unseen keyboard responses, but no implementation details, loss formulation, or validation that the randomization preserves key identity while varying device coloration are supplied. If the randomization inadvertently removes discriminative acoustic features, the contrastive alignment objective would be undermined.

minor comments (2)

[Abstract] Abstract: performance gains are asserted without any numerical values or baseline names; adding a single sentence summarizing key metrics would improve clarity.
[Method] Notation: the four stages are referred to inconsistently between the abstract and method description; a single numbered list or diagram would aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. We address each major comment below with clarifications and commit to revisions that directly strengthen the verification of our claims.

read point-by-point responses

Referee: [§4.2] §4.2 (domain-adversarial disentanglement stage): the manuscript does not report the final keyboard-classification accuracy of a discriminator applied to the learned embeddings. Without evidence that this accuracy approaches the random baseline of approximately 2.7% for 37 classes, the claim that keyboard identity has been suppressed cannot be confirmed, leaving open the possibility that cross-keyboard gains arise from dataset correlations rather than the intended invariance.

Authors: We agree that reporting the keyboard-classification accuracy of the discriminator on the final embeddings is necessary to rigorously confirm successful disentanglement. Although the manuscript emphasizes downstream keystroke identification, we will add this metric in the revised version, showing that accuracy approaches the random baseline of 1/37 ≈ 2.7% and thereby substantiating that keyboard identity has been suppressed rather than relying on dataset correlations. revision: yes
Referee: [§5] §5 (experimental results on HEAR): the central performance claims rest on improvements over baselines in cross-keyboard and cross-user settings, yet no ablation table isolating the contribution of each of the four stages, no definition of the strong baselines, and no quantitative metrics (e.g., accuracy deltas) are referenced in the evaluation. This absence prevents attribution of gains specifically to domain invariance versus the LLM rectification layer.

Authors: We acknowledge that the current presentation lacks sufficient granularity for attributing gains. In the revision we will insert a dedicated ablation table quantifying the incremental contribution of each of the four DECKER stages, explicitly define the strong baselines (including architectures, training protocols, and feature types), and report concrete accuracy values together with deltas for cross-keyboard and cross-user settings as well as the additional improvement from the LLM layer. revision: yes
Referee: [§3.4] §3.4 (acoustic style randomization): the stage is described as synthesizing unseen keyboard responses, but no implementation details, loss formulation, or validation that the randomization preserves key identity while varying device coloration are supplied. If the randomization inadvertently removes discriminative acoustic features, the contrastive alignment objective would be undermined.

Authors: We will expand Section 3.4 with the precise implementation details and loss formulation of the acoustic style randomization. We will also include validation experiments (e.g., key-classification accuracy before versus after randomization) demonstrating that key-discriminative information is retained while device coloration is varied, ensuring the subsequent contrastive alignment stage remains effective. revision: yes

Circularity Check

0 steps flagged

No circularity: standard domain-adversarial pipeline evaluated on new dataset

full rationale

The paper introduces the HEAR dataset and applies a four-stage ML pipeline (normalization, adversarial disentanglement, contrastive alignment, style randomization) plus optional LLM post-processing. No equations, fitted parameters, or self-citations are presented as load-bearing derivations that reduce the reported cross-keyboard gains to quantities defined solely by the method's own inputs. The central claims rest on empirical benchmark results rather than any self-referential construction or renaming of known patterns. This is the expected non-finding for an applied ML security paper whose improvements are measured externally against baselines on held-out data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that keyboard identity and keystroke identity are separable in the acoustic signal; no new physical entities are postulated and free parameters are the usual ML hyperparameters whose values are not reported in the abstract.

free parameters (1)

adversarial loss weight and contrastive temperature
Control the strength of domain disentanglement and key alignment; their specific values are not stated in the abstract.

axioms (1)

domain assumption Acoustic features contain separable components for keyboard identity and individual keystroke identity
Invoked by the design of domain-adversarial disentanglement and cross-keyboard contrastive alignment stages.

pith-pipeline@v0.9.0 · 5616 in / 1424 out tokens · 59046 ms · 2026-05-07T15:49:54.361151+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 5 canonical work pages

[1]

Anonymous. 2023. Heimdall: Passive Acoustic Keystroke Eavesdropping under Non-Line-of-Sight Constraints. arXiv preprint. Preprint (cite official venue when available)

2023
[2]

Dmitri Asonov and Rakesh Agrawal. 2004. Keyboard Acoustic Emanations. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE, 3–11

2004
[3]

Seyyed Ali Ayati, Jin Hyun Park, Yichen Cai, and Marcus Botacin. 2025. Making Acoustic Side-Channel Attacks on Noisy Keyboards Viable with LLM-Assisted Spectrograms’" Typo" Correction.arXiv preprint arXiv:2504.11622(2025)

work page arXiv 2025
[4]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representa- tions. InAdvances in Neural Information Processing Systems (NeurIPS)

2020
[5]

Sanyuan Chen et al. 2022. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing.IEEE/ACM Transactions on Audio, Speech and Language Processing(2022)

2022
[6]

Brecht Desplanques, Joren Thienpondt, and Luc Reynaert. 2020. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN based Speaker Verification. InProceedings of Interspeech

2020
[7]

2023.Beyond the Clicks: Exploring Key- board Acoustic Hacking

Abhishek Fadake and Prof Wadaganve. 2023.Beyond the Clicks: Exploring Key- board Acoustic Hacking. doi:10.13140/RG.2.2.36617.83048/2

work page doi:10.13140/rg.2.2.36617.83048/2 2023
[8]

Denis Foo Kune and Yongdae Kim. 2010. Timing attacks on pin input devices. In Proceedings of the 17th ACM conference on Computer and communications security. 678–680

2010
[9]

Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised Domain Adaptation by Backpropagation. InInternational Conference on Machine Learning (ICML)

2015
[10]

Gunasekaran et al

B. Gunasekaran et al. 2018. Continuous Acoustic Keystroke Detection on Mobile Phones. InProceedings of the Annual International Conference on Mobile Systems, Applications, and Services (MobiSys)

2018
[11]

Joshua Harrison, Ehsan Toreini, and Maryam Mehrnezhad. 2023. A Practical Deep Learning-Based Acoustic Side Channel Attack on Keyboards. In2023 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW). 270–280. doi:10.1109/EuroSPW59978.2023.00034

work page doi:10.1109/eurospw59978.2023.00034 2023
[12]

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Tsai, Kushal Lakhotia, Ruslan Salakhut- dinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. InICASSP / arXiv

2021
[13]

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised Contrastive Learning. InAdvances in Neural Information Processing Systems (NeurIPS)

2020
[14]

Shuo Liu, Weize Quan, Yuan Liu, and Dong-Ming Yan. 2022. Bi-Directional Modality Fusion Network For Audio-Visual Event Localization. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4868–4872. doi:10.1109/ICASSP43922.2022.9746280

work page doi:10.1109/icassp43922.2022.9746280 2022
[15]

Mai et al

T. Mai et al. 2024. RefleXnoop: Near-Line-of-Sight Acoustic Keystroke Inference via Laptop Screen Reflections. ACM CCS / preprint. Project / preprint (cite exact source when available)

2024
[16]

Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D

Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. InProceedings of Interspeech

2019
[17]

2019.Language Models are Unsupervised Multitask Learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019.Language Models are Unsupervised Multitask Learners. Technical Report. OpenAI. GPT-2 technical report

2019
[18]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.Journal of Machine Learning Research (JMLR)(2020)

2020
[19]

Raghuram et al

N. Raghuram et al. 2023. Acoustic Keystroke Inference Using Deep Learning. Pattern Recognition Letters(2023)

2023
[20]

Rahman et al

Md. Rahman et al . 2021. I Hear Your Passwords: Acoustic Emanations from Laptop Keyboards. InUSENIX Security Symposium

2021
[21]

SomeAuthor and B

A. SomeAuthor and B. OtherAuthor. 2020. Acoustic Reflections and Multipath in Consumer Laptops. InProceedings of ACM/IEEE Sensys. Use exact citation if available

2020
[22]

Dawn Xiaodong Song, David Wagner, and Xuqing Tian. 2001. Timing analysis of keystrokes and timing attacks on {SSH}. In10th USENIX Security Symposium (USENIX Security 01)

2001
[23]

and Kolter, J

A. Trockman and J. Z. Kolter. 2022. Patches Are All You Need?arXiv preprint arXiv:2201.09792(2022)

work page arXiv 2022
[24]

Tianyi Zhang et al. 2023. Acoustic Side-Channel Attack on Modern Keyboards. Scientific Reports(2023)

2023
[25]

participant

Li Zhuang, Feng Zhou, and Doug Tygar. 2009. Keyboard Acoustic Emanations Revisited. InProceedings of the 2009 ACM Conference on Computer and Communi- cations Security. ACM, 373–382. Figure 4: t-SNE visualization of ECAPA-TDNN embeddings. Left: Without KSN, embeddings are dominated by keyboard- specific clustering. Right: With KSN enabled, keyboard- depend...

2009