pith. sign in

arxiv: 2605.24313 · v1 · pith:CTCPAXY2new · submitted 2026-05-23 · 💻 cs.CL · cs.HC

End-to-End Intracortical Speech Decoding from Neural Activity

Pith reviewed 2026-06-30 13:55 UTC · model grok-4.3

classification 💻 cs.CL cs.HC
keywords intracortical speech decodingend-to-end neural decodercharacter error ratebrain-computer interfaceALS participantConformer modelspeech neuroprosthesisword boundary segmentation
0
0 comments X

The pith

An end-to-end Conformer decoder extracts character sequences from intracortical brain signals at 23.80 percent error rate without any external language model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether meaningful character-level output can be obtained directly from intracortical recordings using only a neural decoder, without an external language model to assist inference. It trains the decoder on data from one ALS participant and measures performance on held-out validation sessions, reporting a character error rate of 23.80 percent. This setup matters because it removes added memory, computation, and latency costs while still producing usable character output that can serve as input to later linguistic stages. The results indicate that the neural signal itself carries enough information for direct decoding, with most errors traced to word-boundary mistakes rather than letter confusions.

Core claim

An end-to-end Conformer-based neural decoder trained directly on intracortical recordings from a participant with ALS achieves a character error rate of 23.80 percent on held-out validation data without any external language model. Performance variability stems mainly from inter-session signal degradation, and the dominant error type is incorrect word boundary segmentation. These outcomes establish that effective character-level decoding is possible in a fully end-to-end framework and that the decoded neural signal supplies a strong foundation for downstream linguistic processing.

What carries the argument

The end-to-end Conformer-based neural decoder trained directly on intracortical recordings, which maps raw neural activity to character sequences without intermediate language-model correction.

If this is right

  • Character sequences can be produced from neural activity alone, removing the need for an external language model at inference time.
  • The decoded output remains usable as input to any later language-processing stage.
  • Inter-session signal changes are the primary driver of performance drops, pointing to signal stability as the next limiting factor.
  • Word-boundary errors dominate over letter-level mistakes, suggesting boundary detection as a high-value target for further refinement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If signal stability across sessions can be improved through hardware or preprocessing changes, the same decoder architecture would likely show lower error rates on new data.
  • The end-to-end character stream could be fed into existing language models as an additional input rather than replaced by them, potentially combining the strengths of both.
  • The approach isolates the contribution of the raw neural signal, allowing direct comparison of decoder performance across different recording sites or participant groups without confounding language-model effects.

Load-bearing premise

Recordings from a single participant contain enough stable information that a decoder trained on some sessions will continue to work on held-out sessions despite changes in the recorded signal.

What would settle it

Re-training and testing the same decoder architecture on additional held-out sessions from the same participant that yield character error rates near 100 percent would show the reported performance does not generalize beyond the specific training sessions used.

Figures

Figures reproduced from arXiv: 2605.24313 by Alberto Galdon, Gonzalo Olivares Granados, Jose A. Gonzalez-Lopez, Marc Ouellet, Owais Mujtaba Khanday.

Figure 1
Figure 1. Figure 1: Overview of the proposed Conformer-based intracor￾tical speech decoding architecture. in neural firing patterns, and other biological factors, often de￾grading cross-session generalization. Prior work has addressed this through recalibration [44], manifold alignment [45], and lightweight adaptation layers [27, 37]. Finally, Conformer architectures [33] have recently emerged as effective encoders for neural… view at source ↗
Figure 2
Figure 2. Figure 2: Mean CER per recording session. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Frequency 0 20 40 60 80 100 120 140 160 Character Error Rate Median [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of CER across validation utterances. architecture. Training without data augmentation results in a performance drop of 15.75% in CER, confirming that augmen￾tation plays a key role in improving robustness and generaliza￾tion. Given that the full Conformer model achieves the best per￾formance, the following analyses focus exclusively on this con￾figuration. 4.2. Session-wise Performance Variabi… view at source ↗
Figure 4
Figure 4. Figure 4: Mean CER as a function of utterance length (in char￾acters). The shaded area represents ± one standard deviation, and the dashed line indicates a linear fit [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Current high-performing intracortical speech neuroprostheses achieve low word error rates but typically rely on external language models during inference, increasing memory, computation, and latency. In this work, we investigate whether meaningful character-level decoding is achievable without such models. We propose an end-to-end Conformer-based neural decoder trained directly on intracortical recordings from a participant with amyotrophic lateral sclerosis (ALS). Without any external language model, the system achieves a character error rate (CER) of 23.80\% on held-out validation data. Analysis shows that performance variability is driven by inter-session signal degradation, while dominant errors arise from incorrect word boundary segmentation. These results demonstrate that effective character-level decoding is possible in a fully end-to-end framework, providing a strong neural signal for downstream linguistic processing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents an end-to-end Conformer-based neural decoder trained directly on intracortical recordings from a single ALS participant. It reports a character error rate of 23.80% on held-out validation data without any external language model, attributes performance variability to inter-session signal degradation, identifies word-boundary segmentation as the dominant error type, and concludes that effective character-level decoding is achievable in a fully end-to-end framework, yielding a strong neural signal for downstream linguistic processing.

Significance. If the held-out validation set is demonstrably session-disjoint, the result would be significant because it establishes that usable character-level decoding is possible without an external LM, directly addressing latency, memory, and compute concerns in intracortical speech neuroprostheses. The explicit reporting of a numeric CER on held-out data and the error analysis constitute concrete, falsifiable claims that strengthen the contribution relative to LM-dependent baselines.

major comments (1)
  1. [Abstract / Methods] Abstract and Methods: The claim that the 23.80% CER on held-out validation data reflects a 'strong neural signal' independent of session effects is load-bearing, yet the manuscript provides no explicit description of how the train/validation split respects session boundaries. Because the abstract itself states that performance variability is driven by inter-session signal degradation, it is necessary to verify that validation utterances come from temporally and session-disjoint blocks; otherwise the reported CER could be inflated by shared non-stationarities rather than stable neural information.
minor comments (2)
  1. The manuscript should report model hyperparameters, training procedure, data-split statistics (number of sessions, utterances per split), and any statistical significance testing around the 23.80% CER to allow independent assessment of the result.
  2. Figure or table presenting per-session CER values would directly support the inter-session degradation analysis and make the variability claim more transparent.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the data splitting procedure. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and Methods: The claim that the 23.80% CER on held-out validation data reflects a 'strong neural signal' independent of session effects is load-bearing, yet the manuscript provides no explicit description of how the train/validation split respects session boundaries. Because the abstract itself states that performance variability is driven by inter-session signal degradation, it is necessary to verify that validation utterances come from temporally and session-disjoint blocks; otherwise the reported CER could be inflated by shared non-stationarities rather than stable neural information.

    Authors: We agree that an explicit description of the session-disjoint nature of the split is required to support the interpretation of the reported CER. The current manuscript does not provide this level of detail in the Methods section. In the revision we will add a clear statement that the train/validation partition was performed at the session level, with all validation utterances drawn from temporally later sessions that share no overlap with the training sessions. This procedure was chosen precisely to mitigate the inter-session signal degradation highlighted in the abstract and to ensure the CER reflects generalization rather than within-session non-stationarities. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical reporting of held-out performance

full rationale

The paper presents an empirical result: a Conformer model trained on intracortical recordings achieves 23.80% CER on held-out validation data without an external language model. This is a direct measurement on data not used in training, with no mathematical derivation chain, no parameters fitted to a subset then renamed as predictions, and no load-bearing self-citations or uniqueness theorems invoked. The central claim rests on observable performance metrics rather than any reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that intracortical signals from one participant suffice for character-level decoding; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Intracortical neural recordings from a single ALS participant contain sufficient information for meaningful character-level speech decoding without external linguistic models.
    This premise is required for the end-to-end training claim to hold and is invoked throughout the abstract.

pith-pipeline@v0.9.1-grok · 5678 in / 1239 out tokens · 42591 ms · 2026-06-30T13:55:08.873483+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    Introduction Neural speech prostheses [1, 2] represent one of the most am- bitious frontiers in modern neuroscience and biomedical engi- neering, offering the prospect of restoring lost communication to individuals with severe neurological conditions [3, 4, 5, 6]. Among the populations who stand to benefit most are those af- fected by amyotrophic lateral ...

  2. [2]

    Related Work The decoding of speech and language from neural signals has progressed rapidly across multiple recording modalities [39, 40, 41]. Early work with ECoG demonstrated that neural activity in speech-related cortical regions contains sufficient informa- tion to reconstruct acoustic features and classify phonemes [19, 41]. Sequence-to-sequence appr...

  3. [3]

    The proposed pipeline, depicted in Fig

    Methods We evaluate an end-to-end intracortical speech decoder on the public Brain-to-Text ’25 benchmark. The proposed pipeline, depicted in Fig. 1, first applies a session-specific alignment layer to the neural features, followed by temporal patch embed- ding and a Conformer encoder that predicts character sequences with a CTC objective. During training,...

  4. [4]

    Results In this section, we report the performance of the proposed model on the Brain-to-Text ’25 benchmark, focusing on the val- idation set (1,426 sentences), and analyze the main factors in- fluencing its behavior. 4.1. Overall Performance and Model Comparison We first compare in Table 3 the proposed Conformer-based model against the baseline provided ...

  5. [5]

    Conclusion In this work, we presented an end-to-end Conformer-based decoder for intracortical speech neuroprostheses that directly maps neural activity to character sequences. By combining dataset augmentation, a session-specific alignment layer, tem- poral patch embedding, and a Conformer encoder trained with a CTC objective and entropy regularization, t...

  6. [6]

    Acknowledgement This work was supported by grants PID2022-141378OB- C22 and AIA2025-163317-C32 funded by MI- CIU/AEI/10.13039/501100011033 and ERDF/EU

  7. [7]

    Brain-computer interfaces for restoring communi- cation,

    E. F. Chang, “Brain-computer interfaces for restoring communi- cation,”New England Journal of Medicine, vol. 391, no. 7, pp. 654–657, 2024

  8. [8]

    The speech neuroprosthesis,

    A. B. Silva, K. T. Littlejohn, J. R. Liu, D. A. Moses, and E. F. Chang, “The speech neuroprosthesis,”Nature Reviews Neuro- science, vol. 25, no. 7, pp. 473–492, 2024

  9. [9]

    Neuronal ensemble control of prosthetic devices by a human with tetraplegia,

    L. R. Hochberg, M. D. Serruya, G. M. Friehs, J. A. Mukand, M. Saleh, A. H. Caplan, A. Branner, D. Chen, R. D. Penn, and J. P. Donoghue, “Neuronal ensemble control of prosthetic devices by a human with tetraplegia,”Nature, vol. 442, no. 7099, pp. 164– 171, 2006

  10. [10]

    Cortical con- trol of arm movements: A dynamical systems perspective,

    K. V . Shenoy, M. Sahani, and M. M. Churchland, “Cortical con- trol of arm movements: A dynamical systems perspective,”An- nual Review of Neuroscience, vol. 36, pp. 337–359, 2013

  11. [11]

    Cognitive neural prosthetics,

    R. A. Andersen, J. W. Burdick, S. Musallam, B. Pesaran, and J. G. Cham, “Cognitive neural prosthetics,”Trends in Cognitive Sciences, vol. 8, no. 11, pp. 486–493, 2004

  12. [12]

    Connecting cortex to machines: Recent advances in brain interfaces,

    J. P. Donoghue, “Connecting cortex to machines: Recent advances in brain interfaces,”Nature Neuroscience, vol. 5, pp. 1085–1088, 2002

  13. [13]

    A spelling device for the paralysed,

    N. Birbaumer, N. Ghanayim, T. Hinterberger, I. Iversen, B. Kotchoubey, A. K ¨ubler, J. Perelmouter, E. Taub, and H. Flor, “A spelling device for the paralysed,”Nature, vol. 398, no. 6725, pp. 297–298, 1999

  14. [14]

    Brain-computer interfaces for communication and rehabilita- tion,

    U. Chaudhary, N. Birbaumer, and A. Ramos-Murguialday, “Brain-computer interfaces for communication and rehabilita- tion,”Nature Reviews Neurology, vol. 12, no. 9, pp. 513–525, 2016

  15. [15]

    Fully implanted brain-computer interface in a locked-in patient with ALS,

    M. J. Vansteensel, E. G. M. Pels, M. G. Bleichner, M. P. Branco, T. Denison, Z. V . Freudenburg, P. Gosselaar, S. Leinders, T. H. Ottens, M. A. Van Den Boom, P. C. Van Rijen, E. J. Aarnoutse, and N. F. Ramsey, “Fully implanted brain-computer interface in a locked-in patient with ALS,”New England Journal of Medicine, vol. 375, no. 21, pp. 2060–2066, 2016

  16. [16]

    Brain-computer interfaces for communica- tion and control,

    J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M. Vaughan, “Brain-computer interfaces for communica- tion and control,”Clinical Neurophysiology, vol. 113, no. 6, pp. 767–791, 2002

  17. [17]

    Brain-machine interfaces: Past, present and future,

    M. A. Lebedev and M. A. L. Nicolelis, “Brain-machine interfaces: Past, present and future,”Trends in Neurosciences, vol. 29, no. 9, pp. 536–546, 2006

  18. [18]

    A brain-computer interface using electrocortico- graphic signals in humans,

    E. C. Leuthardt, G. Schalk, J. R. Wolpaw, J. G. Ojemann, and D. W. Moran, “A brain-computer interface using electrocortico- graphic signals in humans,”Journal of Neural Engineering, vol. 1, no. 2, pp. 63–71, 2004

  19. [19]

    The open dataset of EEG motor imagery: BCI motor imagery data from healthy subjects,

    G. Wang, C. Teng, K. Li, Z. Zhang, and Y . Chai, “The open dataset of EEG motor imagery: BCI motor imagery data from healthy subjects,”Frontiers in Neuroscience, vol. 16, p. 1044299, 2022

  20. [20]

    Semantic reconstruc- tion of continuous language from non-invasive brain recordings,

    J. Tang, A. LeBel, S. Jain, and A. G. Huth, “Semantic reconstruc- tion of continuous language from non-invasive brain recordings,” Nature Neuroscience, vol. 26, no. 5, pp. 858–866, 2023

  21. [21]

    Enhancing detection of ssveps for a high-speed brain speller using task-related component analysis,

    M. Nakanishi, Y . Wang, X. Chen, Y .-T. Wang, X. Gao, and T.- P. Jung, “Enhancing detection of ssveps for a high-speed brain speller using task-related component analysis,”IEEE Transac- tions on Biomedical Engineering, vol. 65, no. 1, pp. 104–112, 2018

  22. [22]

    A comprehensive review of EEG-based brain-computer interface paradigms,

    R. Abiri, S. Borhani, E. W. Sellers, Y . Jiang, and X. Zhao, “A comprehensive review of EEG-based brain-computer interface paradigms,”Journal of Neural Engineering, vol. 16, no. 1, p. 011001, 2019

  23. [23]

    Machine translation of cortical activity to text with an encoder-decoder framework,

    J. G. Makin, D. A. Moses, and E. F. Chang, “Machine translation of cortical activity to text with an encoder-decoder framework,” Nature Neuroscience, vol. 23, no. 4, pp. 575–582, 2020

  24. [24]

    Neuropros- thesis for decoding speech in a paralyzed person with anarthria,

    D. A. Moses, S. L. Metzger, J. R. Liu, G. K. Anumanchipalli, J. G. Makin, P. F. Sun, J. Chartier, M. E. Dougherty, P. M. Liu, G. M. Abrams, A. Tu-Chan, K. Ganguly, and E. F. Chang, “Neuropros- thesis for decoding speech in a paralyzed person with anarthria,” New England Journal of Medicine, vol. 385, no. 3, pp. 217–227, 2021

  25. [25]

    Brain-to-text: Decoding spoken phrases from phone representations in the brain,

    C. Herff, D. Heger, A. De Pesters, D. Telaar, P. Brunner, G. Schalk, and T. Schultz, “Brain-to-text: Decoding spoken phrases from phone representations in the brain,”Frontiers in Neuroscience, vol. 9, p. 217, 2015

  26. [26]

    Speech synthesis from ECoG using densely connected 3D convolutional neural networks,

    M. Angrick, C. Herff, E. Mugler, M. C. Tate, M. W. Slutzky, D. J. Krusienski, and T. Schultz, “Speech synthesis from ECoG using densely connected 3D convolutional neural networks,”Journal of Neural Engineering, vol. 16, no. 3, p. 036019, 2019

  27. [27]

    High performance communication by people with paralysis using an intracortical brain-computer interface,

    C. Pandarinath, P. Nuyujukian, C. H. Blabe, B. L. Sorice, J. Saab, F. R. Willett, L. R. Hochberg, K. V . Shenoy, and J. M. Hender- son, “High performance communication by people with paralysis using an intracortical brain-computer interface,”eLife, vol. 6, p. e18554, 2017

  28. [28]

    Clini- cal translation of a high-performance neural prosthesis,

    V . Gilja, C. Pandarinath, C. H. Blabe, P. Nuyujukian, J. D. Simeral, A. A. Sarma, B. L. Sorice, J. A. Perge, B. Jarosiewicz, L. R. Hochberg, K. V . Shenoy, and J. M. Henderson, “Clini- cal translation of a high-performance neural prosthesis,”Nature Medicine, vol. 21, no. 10, pp. 1142–1145, 2015

  29. [29]

    Reach and grasp by people with tetraplegia using a neurally controlled robotic arm,

    L. R. Hochberg, D. Bacher, B. Jarosiewicz, N. Y . Masse, J. D. Simeral, J. V ogel, S. Haddadin, J. Liu, S. S. Cash, P. van der Smagt, and J. P. Donoghue, “Reach and grasp by people with tetraplegia using a neurally controlled robotic arm,”Nature, vol. 485, no. 7398, pp. 372–375, 2012

  30. [30]

    Accurate estimation of neural population dynam- ics without spike sorting,

    E. M. Trautmann, S. D. Stavisky, S. Lahiri, K. C. Ames, M. T. Kaufman, D. J. O’Shea, S. Vyas, X. Sun, I. Bhowmick, S. Bhowmick, B. M. Yu, N. Even-Chen, J. M. Henderson, and K. V . Shenoy, “Accurate estimation of neural population dynam- ics without spike sorting,”Neuron, vol. 103, no. 2, pp. 292–308, 2019

  31. [31]

    High-performance brain-to-text communica- tion via handwriting,

    F. R. Willett, D. T. Avansino, L. R. Hochberg, J. M. Henderson, and K. V . Shenoy, “High-performance brain-to-text communica- tion via handwriting,”Nature, vol. 593, no. 7858, pp. 249–254, 2021

  32. [32]

    A high-performance speech neuro- prosthesis,

    F. R. Willett, E. M. Kunz, C. Fan, D. T. Avansino, G. H. Wilson, E. Y . Choi, F. Kamdar, L. R. Hochberg, J. M. Henderson, P. Bhatt, P. Rezaii, and K. V . Shenoy, “A high-performance speech neuro- prosthesis,”Nature, vol. 620, no. 7976, pp. 1031–1036, 2023

  33. [33]

    An accurate and rapidly calibrating speech neuroprosthesis,

    N. S. Card, M. Wairagkar, C. Iacono, P. Bhatt, T. Singer-Clark, F. R. Willett, K. C. Ames, J. Liu, P. Rezaii, L. R. Hochberg, J. M. Henderson, K. V . Shenoy, and D. M. Brandman, “An accurate and rapidly calibrating speech neuroprosthesis,”New England Journal of Medicine, vol. 391, no. 7, pp. 609–618, 2024

  34. [34]

    wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

    A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460, 2020

  35. [35]

    Robust speech recognition via large-scale weak su- pervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,”Proceedings of the International Conference on Ma- chine Learning, pp. 28 492–28 518, 2023

  36. [36]

    Con- nectionist temporal classification: Labelling unsegmented se- quence data with recurrent neural networks,

    A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classification: Labelling unsegmented se- quence data with recurrent neural networks,”Proceedings of the International Conference on Machine Learning, pp. 369–376, 2006

  37. [37]

    Deep Speech: Scaling up end-to-end speech recognition

    A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y . Ng, “Deep speech: Scaling up end-to-end speech recogni- tion,” inarXiv preprint arXiv:1412.5567, 2014

  38. [38]

    Single-trial dynamics of motor cortex and their applications to brain-machine interfaces,

    J. C. Kao, P. Nuyujukian, S. I. Ryu, M. M. Churchland, J. P. Cunningham, and K. V . Shenoy, “Single-trial dynamics of motor cortex and their applications to brain-machine interfaces,”Nature Communications, vol. 6, p. 7759, 2015

  39. [39]

    Conformer: Convolution-augmented transformer for speech recognition,

    A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020, pp. 5036–5040

  40. [40]

    Neural control of cursor trajectory and click by a human with tetraplegia 1000 days after implant of an intracorti- cal microelectrode array,

    J. D. Simeral, S.-P. Kim, M. J. Black, J. P. Donoghue, and L. R. Hochberg, “Neural control of cursor trajectory and click by a human with tetraplegia 1000 days after implant of an intracorti- cal microelectrode array,”Journal of Neural Engineering, vol. 8, no. 2, p. 025027, 2011

  41. [41]

    Neural manifolds for the control of movement,

    J. A. Gallego, M. G. Perich, L. E. Miller, and S. A. Solla, “Neural manifolds for the control of movement,”Neuron, vol. 94, no. 5, pp. 978–984, 2017

  42. [42]

    Single-unit stability us- ing chronically implanted multielectrode arrays in motor cortex of macaque monkeys,

    C. A. Chestek, V . Gilja, P. Nuyujukian, J. D. Foster, J. M. Fan, M. T. Kaufman, M. M. Churchland, Z. Rivera-Alvidrez, J. P. Cun- ningham, S. I. Ryu, and K. V . Shenoy, “Single-unit stability us- ing chronically implanted multielectrode arrays in motor cortex of macaque monkeys,”Journal of Neurophysiology, vol. 105, no. 2, pp. 567–579, 2011

  43. [43]

    Jasper: An end-to-end con- volutional neural acoustic model,

    J. Li, V . Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen, and R. T. Gadde, “Jasper: An end-to-end con- volutional neural acoustic model,” inProc. Interspeech, 2019, pp. 71–75

  44. [44]

    SpecAugment: A simple data augmen- tation method for automatic speech recognition,

    D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “SpecAugment: A simple data augmen- tation method for automatic speech recognition,” inProc. Inter- speech, 2019, pp. 2613–2617

  45. [45]

    Brain-computer interfaces for speech communica- tion,

    J. S. Brumberg, A. Nieto-Castanon, P. R. Kennedy, and F. H. Guenther, “Brain-computer interfaces for speech communica- tion,”Speech Communication, vol. 52, no. 4, pp. 367–379, 2010

  46. [46]

    Decoding spectrotemporal features of overt and covert speech from the hu- man cortex,

    S. Martin, P. Brunner, C. Holdgraf, H.-J. Heinze, N. E. Crone, J. Rieger, G. Schalk, R. T. Knight, and B. N. Pasley, “Decoding spectrotemporal features of overt and covert speech from the hu- man cortex,”Frontiers in Neuroengineering, vol. 7, p. 14, 2014

  47. [47]

    Speech syn- thesis from neural decoding of spoken sentences,

    G. K. Anumanchipalli, J. Chartier, and E. F. Chang, “Speech syn- thesis from neural decoding of spoken sentences,”Nature, vol. 568, no. 7753, pp. 493–498, 2019

  48. [48]

    A high-performance neuroprosthesis for speech decoding and avatar control,

    S. L. Metzger, K. T. Littlejohn, A. B. Silva, D. A. Moses, M. P. Seaton, R. Wang, M. E. Dougherty, J. R. Liu, P. Wu, M. A. Berger, I. Zhuravleva, A. Tu-Chan, K. Ganguly, G. K. Anumanchipalli, and E. F. Chang, “A high-performance neuroprosthesis for speech decoding and avatar control,”Nature, vol. 620, no. 7976, pp. 1037–1046, 2023

  49. [49]

    Listen, attend and spell: A neural network for large vocabulary conversa- tional speech recognition,

    W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversa- tional speech recognition,” inProceedings of the IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960–4964

  50. [50]

    Virtual typing by people with tetraplegia using a self-calibrating intracor- tical brain-computer interface,

    B. Jarosiewicz, A. A. Sarma, D. Bacher, N. Y . Masse, J. D. Simeral, B. Sorice, E. M. Oakley, C. Blabe, C. Pandarinath, V . Gilja, S. S. Cash, E. N. Eskandar, G. Friehs, J. M. Hender- son, K. V . Shenoy, J. P. Donoghue, and L. R. Hochberg, “Virtual typing by people with tetraplegia using a self-calibrating intracor- tical brain-computer interface,”Science...

  51. [51]

    Stabilization of a brain- computer interface via the alignment of low-dimensional spaces of neural activity,

    A. D. Degenhart, W. E. Bishop, E. R. Oby, E. C. Tyler-Kabara, S. M. Chase, A. P. Batista, and B. M. Yu, “Stabilization of a brain- computer interface via the alignment of low-dimensional spaces of neural activity,”Nature Biomedical Engineering, vol. 4, no. 7, pp. 672–685, 2020

  52. [52]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inProc. In- ternational Conference on Learning Representations, 2021

  53. [53]

    Swin transformer: Hierarchical vision transformer us- ing shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer us- ing shifted windows,” inProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2021, pp. 10 012– 10 022