emg2speech: Synthesizing speech from electromyography using self-supervised speech models

Daniel C. Comstock; Harshavardhana T. Gowda; Lee M. Miller

arxiv: 2510.23969 · v2 · submitted 2025-10-28 · 💻 cs.SD · cs.CL· eess.AS

emg2speech: Synthesizing speech from electromyography using self-supervised speech models

Harshavardhana T. Gowda , Daniel C. Comstock , Lee M. Miller This is my paper

Pith reviewed 2026-05-18 03:50 UTC · model grok-4.3

classification 💻 cs.SD cs.CLeess.AS

keywords electromyographyspeech synthesisself-supervised speech modelsneuromuscular interfaceamyotrophic lateral sclerosissilent speechEMG-to-speechorofacial muscles

0 comments

The pith

EMG signals from orofacial muscles map linearly into self-supervised speech representations to generate audio directly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that self-supervised speech model representations are tightly linked to the electrical power of speech muscles, with a simple linear predictor achieving a correlation of r = 0.85. Distinct articulatory gestures produce separable clusters in EMG power space, indicating that the models implicitly capture articulatory structure. These observations support a direct mapping from recorded EMG signals into the model representation space, which then drives speech synthesis. The resulting system produces audio from silent articulation without building explicit models of the vocal tract or training a separate vocoder. The approach is shown to work on orofacial EMG from a participant with ALS.

Core claim

Self-supervised speech representations are strongly linearly related to the electrical power of muscle activity with a correlation of r = 0.85, and EMG power vectors associated with distinct articulatory gestures form structured, separable clusters. This structure permits mapping EMG signals into the S3 representation space and synthesizing speech end-to-end without explicit articulatory modeling or vocoder training, demonstrated by converting silent orofacial EMG from a participant with ALS into audio.

What carries the argument

Linear mapping of EMG signals into the self-supervised speech (S3) representation space, using the observed correlation with muscle power and the separability of gesture clusters.

If this is right

End-to-end EMG-to-speech generation becomes feasible for neuromuscular speech interfaces.
Patients with ALS or similar conditions can produce audio output from silent orofacial muscle activity.
Separate training of articulatory models and vocoders is unnecessary for this synthesis pipeline.
Self-supervised speech models implicitly encode articulatory mechanisms as reflected in EMG activity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same linear relationship may hold for other physiological signals that correlate with internal model representations.
This mapping offers a way to probe what physical production details self-supervised models have learned without direct supervision.
Validation across a larger and more diverse set of participants would test whether the mapping generalizes beyond similar individuals.
Real-time versions of the pipeline could support practical silent-speech communication devices.

Load-bearing premise

The observed linear relationship between S3 representations and EMG power together with cluster separability is sufficient to produce intelligible speech from new silent-articulation recordings.

What would settle it

Apply the mapping to fresh silent-articulation EMG recordings from the same or similar participants, synthesize audio, and obtain output that listeners cannot understand or that yields near-chance word recognition accuracy.

Figures

Figures reproduced from arXiv: 2510.23969 by Daniel C. Comstock, Harshavardhana T. Gowda, Lee M. Miller.

**Figure 1.** Figure 1: LEFT: Electrode placement on the left side of the neck. MIDDLE: Electrode placement on the right side of the neck. RIGHT: Electrode placement on the left cheek. complete, they click the mouse again to indicate the end, causing the sentence to disappear from the screen—thus allowing them to articulate at their own pace. We adapt the language corpora from [3], who demonstrated a speech brain-computer interfa… view at source ↗

**Figure 2.** Figure 2: Different orofacial gestures are naturally separable. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Layer-wise correlation (r) between D(E) and H across different self-supervised speech models. A simple linear mapping is used to predict D(E) from H. We also examined whether a similar linear mapping exists between EMG spectrogram features (vec(B)) and H. Frequency bands of B are obtained using five log-spaced frequency bins, as described in section 4. However, the resulting correlation coefficients are su… view at source ↗

**Figure 4.** Figure 4: Layer-wise correlation (r) between B and H across different self-supervised speech models. A simple linear mapping is used to predict B from H. 5.3 emg2speech synthesis As shown earlier, the following relationship holds: H linear mapping −−−−−−−−−→ D(E) gesture-specific clustering −−−−−−−−−−−−−−−−−−→ OROFACIAL MOVEMENTS. The existence of a simple linear mapping from H to D(E) is significant: it reveals tha… view at source ↗

**Figure 5.** Figure 5: Multivariate EMG signals are converted into [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

We present a neuromuscular speech interface that translates electromyographic (EMG) signals recorded from orofacial muscles during speech articulation directly into audio. We find that self-supervised speech (S3) representations are strongly linearly related to the electrical power of muscle activity: a simple linear mapping predicts EMG power from S3 representations with a correlation of r = 0.85. In addition, EMG power vectors associated with distinct articulatory gestures form structured, separable clusters. Together, these observations suggest that S3 models implicitly encode articulatory mechanisms, as reflected in EMG activity. Leveraging this structure, we map EMG signals into the S3 representation space and synthesize speech, enabling end-to-end EMG-to-speech generation without explicit articulatory modeling or vocoder training. We demonstrate this system with a participant with amyotrophic lateral sclerosis (ALS), converting orofacial EMG recorded while she silently articulated speech into audio.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a linear EMG-to-S3 mapping that lets them synthesize audio from silent orofacial EMG in one ALS participant, but the evidence rests on a coarse power correlation and lacks quantitative checks on output quality.

read the letter

The main takeaway is that this work identifies a strong linear relationship between self-supervised speech representations and EMG power from facial muscles, then uses it to map silent EMG recordings directly into audio synthesis for an ALS patient. It is a straightforward pipeline that avoids building custom articulatory models or training vocoders from scratch. The r=0.85 correlation and the observation that EMG power vectors form separable clusters for different gestures are the concrete observations they build on. That part is clean and incremental: it takes existing S3 models as a black box and shows they already carry enough structure to support a simple inversion for this use case. The single-participant demonstration on silent articulation adds a practical angle for assistive devices in motor speech disorders. Those elements are worth noting because they keep the method low on free parameters and leverage pre-trained artifacts rather than starting from zero. The soft spots sit mainly in how much the central claim is actually supported. The reported correlation is specifically with EMG power, which is a summary statistic that drops fine temporal structure and channel timing. Inverting from EMG back to full S3 representations therefore risks under-constraining the phonetic content, especially when moving from voiced training data to silent test recordings. The abstract describes a successful demonstration but supplies no word error rates, perceptual scores, spectrogram comparisons, or error bars, so it is difficult to judge whether the output is intelligible enough to be useful. With results from only one participant, questions about reproducibility and generalization remain open. The full methods would need to spell out exactly how the EMG-to-S3 regressor is trained and inverted. This paper is for researchers working on neural interfaces or low-data EMG speech systems who want a simple starting point rather than a fully validated system. A reader focused on practical assistive tech could extract the linear-mapping idea and the cluster observation even if they plan to add more robust validation later. It deserves a serious referee because the approach is grounded in observable data, the application is relevant, and the simplicity makes it easy to test or extend. I would recommend sending it to peer review with the expectation that reviewers will ask for quantitative synthesis metrics and additional participants.

Referee Report

2 major / 2 minor

Summary. The paper introduces emg2speech, a neuromuscular interface that translates orofacial EMG signals recorded during speech articulation into audio using self-supervised speech (S3) models. It reports a strong linear correlation (r = 0.85) between S3 representations and EMG power, notes that EMG power vectors for distinct articulatory gestures form separable clusters, and demonstrates mapping EMG signals into S3 space to synthesize speech from silent articulations by an ALS participant, without explicit articulatory modeling or vocoder training.

Significance. If the synthesis produces intelligible speech, the approach could be significant for assistive speech technologies, particularly for ALS patients, by exploiting pre-trained S3 models to bypass custom vocoder training and explicit articulatory inversion. The reported linear relationship and cluster structure provide an interesting empirical link between EMG activity and S3 representations that may inform representation learning. However, the absence of quantitative synthesis metrics limits the strength of these implications.

major comments (2)

[Abstract] Abstract: The central claim that the system enables 'end-to-end EMG-to-speech generation' and successfully synthesizes speech from silent EMG is load-bearing but supported only by a single-participant demonstration. No quantitative metrics (e.g., word error rate, MOS, or intelligibility scores), error bars, or baseline comparisons are reported, leaving the claim only partially evidenced.
[Methods/Results] Methods/Results (mapping and inversion): The linear relationship is established between S3 representations and EMG power (a scalar summary statistic per channel or time window). It is unclear how this is inverted to recover full time-varying S3 representations that preserve phonetic detail; power discards temporal structure and channel timing, which risks under-constraining the synthesized output for new silent articulations.

minor comments (2)

[Abstract] Abstract: Provide confidence intervals or statistical details for the reported r = 0.85 correlation.
[Results] Consider adding a figure or table showing example synthesized waveforms or spectrograms alongside ground-truth or reference audio for the ALS demonstration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity on the synthesis procedure and to better contextualize the scope of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the system enables 'end-to-end EMG-to-speech generation' and successfully synthesizes speech from silent EMG is load-bearing but supported only by a single-participant demonstration. No quantitative metrics (e.g., word error rate, MOS, or intelligibility scores), error bars, or baseline comparisons are reported, leaving the claim only partially evidenced.

Authors: We agree that the speech synthesis result is a single-participant demonstration without quantitative metrics such as WER, MOS, or intelligibility scores. The primary contributions of the work are the reported linear correlation (r = 0.85) between S3 representations and EMG power and the observation that EMG power vectors form separable clusters. The synthesis example illustrates a potential application rather than serving as the main claim. We have revised the abstract and added a dedicated limitations paragraph in the discussion to explicitly note the preliminary nature of the synthesis result and the absence of objective metrics. We also softened the phrasing around 'end-to-end EMG-to-speech generation' to reflect the current evidence level. revision: yes
Referee: [Methods/Results] Methods/Results (mapping and inversion): The linear relationship is established between S3 representations and EMG power (a scalar summary statistic per channel or time window). It is unclear how this is inverted to recover full time-varying S3 representations that preserve phonetic detail; power discards temporal structure and channel timing, which risks under-constraining the synthesized output for new silent articulations.

Authors: The referee correctly notes that the linear mapping relates S3 representations to EMG power, which is a summary statistic. To invert this for synthesis, the manuscript uses the observed cluster structure: EMG power vectors from silent articulations are matched to the nearest cluster centroid derived from the training data, and the corresponding S3 representation from that cluster is selected and passed to the pre-trained S3 decoder. This leverages the separability of articulatory gestures rather than a direct matrix inversion of the power scalar. We acknowledge that this approach may lose fine temporal detail present in the original EMG time series. We have substantially expanded the Methods section with a step-by-step description of the cluster-based inversion procedure, including pseudocode, and added a paragraph discussing the limitations of using power as an intermediate representation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on external pre-trained models and empirical data fits

full rationale

The paper's chain begins with an observed linear correlation (r = 0.85) between S3 representations and EMG power plus cluster separability, both obtained from data analysis on recorded signals. These empirical findings are then used to construct a mapping from EMG to S3 space for synthesis. S3 models are external pre-trained artifacts, and the linear mapping is a fitted regressor rather than a definitional identity. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing premises. The synthesis claim is therefore not equivalent to its inputs by construction and remains falsifiable against held-out intelligibility metrics.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on a fitted linear mapping whose coefficients are not reported as fixed and on the domain assumption that S3 representations implicitly capture articulatory information reflected in EMG.

free parameters (1)

linear mapping matrix
Coefficients of the linear transform from S3 representations to EMG power are learned from data and not derived from first principles.

axioms (1)

domain assumption S3 representations encode articulatory mechanisms visible in EMG power
Invoked to justify the mapping and cluster separability as evidence for implicit encoding.

pith-pipeline@v0.9.0 · 5698 in / 1195 out tokens · 51398 ms · 2026-05-18T03:50:46.898949+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a simple linear mapping predicts EMG power from S3 representations with a correlation of r = 0.85
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EMG power vectors associated with distinct articulatory gestures form structured, separable clusters

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

An instantaneous voice-synthesis neuroprosthesis.Nature, pages 1–8, 2025

Maitreyee Wairagkar, Nicholas S Card, Tyler Singer-Clark, Xianda Hou, Carrina Iacobacci, Lee M Miller, Leigh R Hochberg, David M Brandman, and Sergey D Stavisky. An instantaneous voice-synthesis neuroprosthesis.Nature, pages 1–8, 2025

work page 2025
[2]

A high-performance neuroprosthesis for speech decoding and avatar control.Nature, 620(7976):1037–1046, 2023

Sean L Metzger, Kaylo T Littlejohn, Alexander B Silva, David A Moses, Margaret P Seaton, Ran Wang, Maximilian E Dougherty, Jessie R Liu, Peter Wu, Michael A Berger, et al. A high-performance neuroprosthesis for speech decoding and avatar control.Nature, 620(7976):1037–1046, 2023

work page 2023
[3]

A high-performance speech neuroprosthesis.Nature, 620(7976):1031–1036, 2023

Francis R Willett, Erin M Kunz, Chaofei Fan, Donald T Avansino, Guy H Wilson, Eun Young Choi, Foram Kamdar, Matthew F Glasser, Leigh R Hochberg, Shaul Druckmann, et al. A high-performance speech neuroprosthesis.Nature, 620(7976):1031–1036, 2023

work page 2023
[4]

Lip to speech synthesis with visual context attentional gan

Minsu Kim, Joanna Hong, and Yong Man Ro. Lip to speech synthesis with visual context attentional gan. Advances in Neural Information Processing Systems, 34:2758–2770, 2021

work page 2021
[5]

Learning individual speaking styles for accurate lip to speech synthesis

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. Learning individual speaking styles for accurate lip to speech synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13796–13805, 2020

work page 2020
[6]

A streaming brain-to- voice neuroprosthesis to restore naturalistic communication.Nature neuroscience, pages 1–11, 2025

Kaylo T Littlejohn, Cheol Jun Cho, Jessie R Liu, Alexander B Silva, Bohan Yu, Vanessa R Anderson, Cady M Kurtz-Miott, Samantha Brosler, Anshul P Kashyap, Irina P Hallinan, et al. A streaming brain-to- voice neuroprosthesis to restore naturalistic communication.Nature neuroscience, pages 1–11, 2025

work page 2025
[7]

Digital voicing of silent speech

David Gaddy and Dan Klein. Digital voicing of silent speech. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5521–5530, 2020. 9

work page 2020
[8]

An improved model for voicing silent speech

David Gaddy and Dan Klein. An improved model for voicing silent speech. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 2: Short Papers), pages 175–181, 2021

work page 2021
[9]

On generative spoken language modeling from raw audio.Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021

Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, and Emmanuel Dupoux. On generative spoken language modeling from raw audio.Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021

work page 2021
[10]

Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A

Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. Tacotron: Towards end-to-end speech synthesis. 2017

work page 2017
[11]

Towards continuous speech recognition using surface electromyography

Szu-Chen Jou, Tanja Schultz, Matthias Walliczek, Florian Kraft, and Alex Waibel. Towards continuous speech recognition using surface electromyography. InNinth International Conference on Spoken Language Processing, 2006

work page 2006
[12]

Non- invasive silent speech recognition in multiple sclerosis with dysphonia

Arnav Kapur, Utkarsh Sarawgi, Eric Wadkins, Matthew Wu, Nora Hollenstein, and Pattie Maes. Non- invasive silent speech recognition in multiple sclerosis with dysphonia. InMachine Learning for Health Workshop, pages 25–38. PMLR, 2020

work page 2020
[13]

Development of semg sensors and algorithms for silent speech recognition.Journal of neural engineering, 15(4):046031, 2018

Geoffrey S Meltzner, James T Heaton, Yunbin Deng, Gianluca De Luca, Serge H Roy, and Joshua C Kline. Development of semg sensors and algorithms for silent speech recognition.Journal of neural engineering, 15(4):046031, 2018

work page 2018
[14]

Toth, Michael Wand, and Tanja Schultz

Arthur R. Toth, Michael Wand, and Tanja Schultz. Synthesizing speech from electromyography using voice transformation techniques. InInterspeech 2009, pages 652–655, 2009

work page 2009
[15]

Emg-to-speech: Direct generation of speech from facial electromyo- graphic signals.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12):2375–2385, 2017

Matthias Janke and Lorenz Diener. Emg-to-speech: Direct generation of speech from facial electromyo- graphic signals.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12):2375–2385, 2017

work page 2017
[16]

Session-independent array-based emg-to- speech conversion using convolutional neural networks

Lorenz Diener, Gerrit Felsch, Miguel Angrick, and Tanja Schultz. Session-independent array-based emg-to- speech conversion using convolutional neural networks. InSpeech Communication; 13th ITG-Symposium, pages 1–5, 2018

work page 2018
[17]

Reardon, and CTRL labs at Reality Labs

Patrick Kaifosh, Thomas R. Reardon, and CTRL labs at Reality Labs. A generic non-invasive neuromotor interface for human-computer interaction.Nature, 645:702–711, 2025

work page 2025
[18]

wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

work page 2020
[19]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

work page 2021
[20]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

work page 2022
[21]

emg2qwerty: A large dataset with baselines for touch typing using surface electromyography

Viswanath Sivakumar, Jeffrey Seely, Alan Du, Sean R Bittner, Adam Berenzweig, Anuoluwapo Bolarinwa, Alexandre Gramfort, and Michael I Mandel. emg2qwerty: A large dataset with baselines for touch typing using surface electromyography. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024
[22]

Rousseeuw

Leonard Kaufman and Peter J. Rousseeuw. Partitioning around medoids (program pam). InWiley Series in Probability and Statistics, pages 68–125. John Wiley & Sons, Inc., Hoboken, NJ, USA, March 8 1990. Retrieved 2021-06-13

work page 1990
[23]

D. M. Halliday and S. F. Farmer. On the need for rectification of surface emg.Journal of Neurophysiology, 103(6):3547, June 2010

work page 2010
[24]

Evidence of vocal tract articulation in self-supervised learning of speech

Cheol Jun Cho, Peter Wu, Abdelrahman Mohamed, and Gopala K Anumanchipalli. Evidence of vocal tract articulation in self-supervised learning of speech. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 10

work page 2023
[25]

Self-supervised models of speech infer universal articulatory kinematics

Cheol Jun Cho, Abdelrahman Mohamed, Alan W Black, and Gopala K Anumanchipalli. Self-supervised models of speech infer universal articulatory kinematics. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12061–12065. IEEE, 2024

work page 2024
[26]

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. InProceedings of the 23rd international conference on Machine learning, pages 369–376, 2006. 11

work page 2006

[1] [1]

An instantaneous voice-synthesis neuroprosthesis.Nature, pages 1–8, 2025

Maitreyee Wairagkar, Nicholas S Card, Tyler Singer-Clark, Xianda Hou, Carrina Iacobacci, Lee M Miller, Leigh R Hochberg, David M Brandman, and Sergey D Stavisky. An instantaneous voice-synthesis neuroprosthesis.Nature, pages 1–8, 2025

work page 2025

[2] [2]

A high-performance neuroprosthesis for speech decoding and avatar control.Nature, 620(7976):1037–1046, 2023

Sean L Metzger, Kaylo T Littlejohn, Alexander B Silva, David A Moses, Margaret P Seaton, Ran Wang, Maximilian E Dougherty, Jessie R Liu, Peter Wu, Michael A Berger, et al. A high-performance neuroprosthesis for speech decoding and avatar control.Nature, 620(7976):1037–1046, 2023

work page 2023

[3] [3]

A high-performance speech neuroprosthesis.Nature, 620(7976):1031–1036, 2023

Francis R Willett, Erin M Kunz, Chaofei Fan, Donald T Avansino, Guy H Wilson, Eun Young Choi, Foram Kamdar, Matthew F Glasser, Leigh R Hochberg, Shaul Druckmann, et al. A high-performance speech neuroprosthesis.Nature, 620(7976):1031–1036, 2023

work page 2023

[4] [4]

Lip to speech synthesis with visual context attentional gan

Minsu Kim, Joanna Hong, and Yong Man Ro. Lip to speech synthesis with visual context attentional gan. Advances in Neural Information Processing Systems, 34:2758–2770, 2021

work page 2021

[5] [5]

Learning individual speaking styles for accurate lip to speech synthesis

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. Learning individual speaking styles for accurate lip to speech synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13796–13805, 2020

work page 2020

[6] [6]

A streaming brain-to- voice neuroprosthesis to restore naturalistic communication.Nature neuroscience, pages 1–11, 2025

Kaylo T Littlejohn, Cheol Jun Cho, Jessie R Liu, Alexander B Silva, Bohan Yu, Vanessa R Anderson, Cady M Kurtz-Miott, Samantha Brosler, Anshul P Kashyap, Irina P Hallinan, et al. A streaming brain-to- voice neuroprosthesis to restore naturalistic communication.Nature neuroscience, pages 1–11, 2025

work page 2025

[7] [7]

Digital voicing of silent speech

David Gaddy and Dan Klein. Digital voicing of silent speech. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5521–5530, 2020. 9

work page 2020

[8] [8]

An improved model for voicing silent speech

David Gaddy and Dan Klein. An improved model for voicing silent speech. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 2: Short Papers), pages 175–181, 2021

work page 2021

[9] [9]

On generative spoken language modeling from raw audio.Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021

Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, and Emmanuel Dupoux. On generative spoken language modeling from raw audio.Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021

work page 2021

[10] [10]

Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A

Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. Tacotron: Towards end-to-end speech synthesis. 2017

work page 2017

[11] [11]

Towards continuous speech recognition using surface electromyography

Szu-Chen Jou, Tanja Schultz, Matthias Walliczek, Florian Kraft, and Alex Waibel. Towards continuous speech recognition using surface electromyography. InNinth International Conference on Spoken Language Processing, 2006

work page 2006

[12] [12]

Non- invasive silent speech recognition in multiple sclerosis with dysphonia

Arnav Kapur, Utkarsh Sarawgi, Eric Wadkins, Matthew Wu, Nora Hollenstein, and Pattie Maes. Non- invasive silent speech recognition in multiple sclerosis with dysphonia. InMachine Learning for Health Workshop, pages 25–38. PMLR, 2020

work page 2020

[13] [13]

Development of semg sensors and algorithms for silent speech recognition.Journal of neural engineering, 15(4):046031, 2018

Geoffrey S Meltzner, James T Heaton, Yunbin Deng, Gianluca De Luca, Serge H Roy, and Joshua C Kline. Development of semg sensors and algorithms for silent speech recognition.Journal of neural engineering, 15(4):046031, 2018

work page 2018

[14] [14]

Toth, Michael Wand, and Tanja Schultz

Arthur R. Toth, Michael Wand, and Tanja Schultz. Synthesizing speech from electromyography using voice transformation techniques. InInterspeech 2009, pages 652–655, 2009

work page 2009

[15] [15]

Emg-to-speech: Direct generation of speech from facial electromyo- graphic signals.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12):2375–2385, 2017

Matthias Janke and Lorenz Diener. Emg-to-speech: Direct generation of speech from facial electromyo- graphic signals.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12):2375–2385, 2017

work page 2017

[16] [16]

Session-independent array-based emg-to- speech conversion using convolutional neural networks

Lorenz Diener, Gerrit Felsch, Miguel Angrick, and Tanja Schultz. Session-independent array-based emg-to- speech conversion using convolutional neural networks. InSpeech Communication; 13th ITG-Symposium, pages 1–5, 2018

work page 2018

[17] [17]

Reardon, and CTRL labs at Reality Labs

Patrick Kaifosh, Thomas R. Reardon, and CTRL labs at Reality Labs. A generic non-invasive neuromotor interface for human-computer interaction.Nature, 645:702–711, 2025

work page 2025

[18] [18]

wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

work page 2020

[19] [19]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

work page 2021

[20] [20]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

work page 2022

[21] [21]

emg2qwerty: A large dataset with baselines for touch typing using surface electromyography

Viswanath Sivakumar, Jeffrey Seely, Alan Du, Sean R Bittner, Adam Berenzweig, Anuoluwapo Bolarinwa, Alexandre Gramfort, and Michael I Mandel. emg2qwerty: A large dataset with baselines for touch typing using surface electromyography. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024

[22] [22]

Rousseeuw

Leonard Kaufman and Peter J. Rousseeuw. Partitioning around medoids (program pam). InWiley Series in Probability and Statistics, pages 68–125. John Wiley & Sons, Inc., Hoboken, NJ, USA, March 8 1990. Retrieved 2021-06-13

work page 1990

[23] [23]

D. M. Halliday and S. F. Farmer. On the need for rectification of surface emg.Journal of Neurophysiology, 103(6):3547, June 2010

work page 2010

[24] [24]

Evidence of vocal tract articulation in self-supervised learning of speech

Cheol Jun Cho, Peter Wu, Abdelrahman Mohamed, and Gopala K Anumanchipalli. Evidence of vocal tract articulation in self-supervised learning of speech. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 10

work page 2023

[25] [25]

Self-supervised models of speech infer universal articulatory kinematics

Cheol Jun Cho, Abdelrahman Mohamed, Alan W Black, and Gopala K Anumanchipalli. Self-supervised models of speech infer universal articulatory kinematics. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12061–12065. IEEE, 2024

work page 2024

[26] [26]

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. InProceedings of the 23rd international conference on Machine learning, pages 369–376, 2006. 11

work page 2006