End-to-End Intracortical Speech Decoding from Neural Activity

Alberto Galdon; Gonzalo Olivares Granados; Jose A. Gonzalez-Lopez; Marc Ouellet; Owais Mujtaba Khanday

arxiv: 2605.24313 · v1 · pith:CTCPAXY2new · submitted 2026-05-23 · 💻 cs.CL · cs.HC

End-to-End Intracortical Speech Decoding from Neural Activity

Owais Mujtaba Khanday , Jose A. Gonzalez-Lopez , Marc Ouellet , Alberto Galdon , Gonzalo Olivares Granados This is my paper

Pith reviewed 2026-06-30 13:55 UTC · model grok-4.3

classification 💻 cs.CL cs.HC

keywords intracortical speech decodingend-to-end neural decodercharacter error ratebrain-computer interfaceALS participantConformer modelspeech neuroprosthesisword boundary segmentation

0 comments

The pith

An end-to-end Conformer decoder extracts character sequences from intracortical brain signals at 23.80 percent error rate without any external language model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether meaningful character-level output can be obtained directly from intracortical recordings using only a neural decoder, without an external language model to assist inference. It trains the decoder on data from one ALS participant and measures performance on held-out validation sessions, reporting a character error rate of 23.80 percent. This setup matters because it removes added memory, computation, and latency costs while still producing usable character output that can serve as input to later linguistic stages. The results indicate that the neural signal itself carries enough information for direct decoding, with most errors traced to word-boundary mistakes rather than letter confusions.

Core claim

An end-to-end Conformer-based neural decoder trained directly on intracortical recordings from a participant with ALS achieves a character error rate of 23.80 percent on held-out validation data without any external language model. Performance variability stems mainly from inter-session signal degradation, and the dominant error type is incorrect word boundary segmentation. These outcomes establish that effective character-level decoding is possible in a fully end-to-end framework and that the decoded neural signal supplies a strong foundation for downstream linguistic processing.

What carries the argument

The end-to-end Conformer-based neural decoder trained directly on intracortical recordings, which maps raw neural activity to character sequences without intermediate language-model correction.

If this is right

Character sequences can be produced from neural activity alone, removing the need for an external language model at inference time.
The decoded output remains usable as input to any later language-processing stage.
Inter-session signal changes are the primary driver of performance drops, pointing to signal stability as the next limiting factor.
Word-boundary errors dominate over letter-level mistakes, suggesting boundary detection as a high-value target for further refinement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If signal stability across sessions can be improved through hardware or preprocessing changes, the same decoder architecture would likely show lower error rates on new data.
The end-to-end character stream could be fed into existing language models as an additional input rather than replaced by them, potentially combining the strengths of both.
The approach isolates the contribution of the raw neural signal, allowing direct comparison of decoder performance across different recording sites or participant groups without confounding language-model effects.

Load-bearing premise

Recordings from a single participant contain enough stable information that a decoder trained on some sessions will continue to work on held-out sessions despite changes in the recorded signal.

What would settle it

Re-training and testing the same decoder architecture on additional held-out sessions from the same participant that yield character error rates near 100 percent would show the reported performance does not generalize beyond the specific training sessions used.

Figures

Figures reproduced from arXiv: 2605.24313 by Alberto Galdon, Gonzalo Olivares Granados, Jose A. Gonzalez-Lopez, Marc Ouellet, Owais Mujtaba Khanday.

**Figure 1.** Figure 1: Overview of the proposed Conformer-based intracortical speech decoding architecture. in neural firing patterns, and other biological factors, often degrading cross-session generalization. Prior work has addressed this through recalibration [44], manifold alignment [45], and lightweight adaptation layers [27, 37]. Finally, Conformer architectures [33] have recently emerged as effective encoders for neural… view at source ↗

**Figure 2.** Figure 2: Mean CER per recording session. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Frequency 0 20 40 60 80 100 120 140 160 Character Error Rate Median [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of CER across validation utterances. architecture. Training without data augmentation results in a performance drop of 15.75% in CER, confirming that augmentation plays a key role in improving robustness and generalization. Given that the full Conformer model achieves the best performance, the following analyses focus exclusively on this configuration. 4.2. Session-wise Performance Variabi… view at source ↗

**Figure 4.** Figure 4: Mean CER as a function of utterance length (in characters). The shaded area represents ± one standard deviation, and the dashed line indicates a linear fit [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Current high-performing intracortical speech neuroprostheses achieve low word error rates but typically rely on external language models during inference, increasing memory, computation, and latency. In this work, we investigate whether meaningful character-level decoding is achievable without such models. We propose an end-to-end Conformer-based neural decoder trained directly on intracortical recordings from a participant with amyotrophic lateral sclerosis (ALS). Without any external language model, the system achieves a character error rate (CER) of 23.80\% on held-out validation data. Analysis shows that performance variability is driven by inter-session signal degradation, while dominant errors arise from incorrect word boundary segmentation. These results demonstrate that effective character-level decoding is possible in a fully end-to-end framework, providing a strong neural signal for downstream linguistic processing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

End-to-end Conformer decoding hits 23.8% CER on intracortical signals without an LM, but the held-out split's independence from sessions is unclear and limits how much the result generalizes.

read the letter

The main point is that they trained a Conformer directly on intracortical recordings from one ALS participant and got 23.8% character error rate on held-out data with no external language model at all. That removes one source of added latency and memory, which is the concrete step forward relative to systems that keep the LM in the loop.

The paper does a straightforward job of showing the architecture choice and breaking down the errors. Word-boundary mistakes dominate, and they tie performance swings to inter-session signal changes. Those observations are useful for anyone thinking about real-world deployment.

The soft spot is the data split. The abstract flags inter-session degradation as a key driver of variability, yet gives no explicit statement that validation utterances come from completely separate sessions or blocks. If the held-out set shares temporal or session-specific signal features with training, the CER could look better than it would on truly new recordings. The stress-test concern lands here because the paper itself emphasizes those session effects. Single-participant data also caps how far the result travels.

This is for readers working on intracortical speech BCIs who want to test whether the LM can be dropped at the character stage. Someone already running similar pipelines would get value from the error patterns and the end-to-end baseline.

I would send it to peer review. The core demonstration is worth a closer look at the methods, especially the split procedure and any significance numbers, even if the current write-up leaves those details thin.

Referee Report

1 major / 2 minor

Summary. The manuscript presents an end-to-end Conformer-based neural decoder trained directly on intracortical recordings from a single ALS participant. It reports a character error rate of 23.80% on held-out validation data without any external language model, attributes performance variability to inter-session signal degradation, identifies word-boundary segmentation as the dominant error type, and concludes that effective character-level decoding is achievable in a fully end-to-end framework, yielding a strong neural signal for downstream linguistic processing.

Significance. If the held-out validation set is demonstrably session-disjoint, the result would be significant because it establishes that usable character-level decoding is possible without an external LM, directly addressing latency, memory, and compute concerns in intracortical speech neuroprostheses. The explicit reporting of a numeric CER on held-out data and the error analysis constitute concrete, falsifiable claims that strengthen the contribution relative to LM-dependent baselines.

major comments (1)

[Abstract / Methods] Abstract and Methods: The claim that the 23.80% CER on held-out validation data reflects a 'strong neural signal' independent of session effects is load-bearing, yet the manuscript provides no explicit description of how the train/validation split respects session boundaries. Because the abstract itself states that performance variability is driven by inter-session signal degradation, it is necessary to verify that validation utterances come from temporally and session-disjoint blocks; otherwise the reported CER could be inflated by shared non-stationarities rather than stable neural information.

minor comments (2)

The manuscript should report model hyperparameters, training procedure, data-split statistics (number of sessions, utterances per split), and any statistical significance testing around the 23.80% CER to allow independent assessment of the result.
Figure or table presenting per-session CER values would directly support the inter-session degradation analysis and make the variability claim more transparent.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the data splitting procedure. We address the single major comment below.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods: The claim that the 23.80% CER on held-out validation data reflects a 'strong neural signal' independent of session effects is load-bearing, yet the manuscript provides no explicit description of how the train/validation split respects session boundaries. Because the abstract itself states that performance variability is driven by inter-session signal degradation, it is necessary to verify that validation utterances come from temporally and session-disjoint blocks; otherwise the reported CER could be inflated by shared non-stationarities rather than stable neural information.

Authors: We agree that an explicit description of the session-disjoint nature of the split is required to support the interpretation of the reported CER. The current manuscript does not provide this level of detail in the Methods section. In the revision we will add a clear statement that the train/validation partition was performed at the session level, with all validation utterances drawn from temporally later sessions that share no overlap with the training sessions. This procedure was chosen precisely to mitigate the inter-session signal degradation highlighted in the abstract and to ensure the CER reflects generalization rather than within-session non-stationarities. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical reporting of held-out performance

full rationale

The paper presents an empirical result: a Conformer model trained on intracortical recordings achieves 23.80% CER on held-out validation data without an external language model. This is a direct measurement on data not used in training, with no mathematical derivation chain, no parameters fitted to a subset then renamed as predictions, and no load-bearing self-citations or uniqueness theorems invoked. The central claim rests on observable performance metrics rather than any reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that intracortical signals from one participant suffice for character-level decoding; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Intracortical neural recordings from a single ALS participant contain sufficient information for meaningful character-level speech decoding without external linguistic models.
This premise is required for the end-to-end training claim to hold and is invoked throughout the abstract.

pith-pipeline@v0.9.1-grok · 5678 in / 1239 out tokens · 42591 ms · 2026-06-30T13:55:08.873483+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Introduction Neural speech prostheses [1, 2] represent one of the most am- bitious frontiers in modern neuroscience and biomedical engi- neering, offering the prospect of restoring lost communication to individuals with severe neurological conditions [3, 4, 5, 6]. Among the populations who stand to benefit most are those af- fected by amyotrophic lateral ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Related Work The decoding of speech and language from neural signals has progressed rapidly across multiple recording modalities [39, 40, 41]. Early work with ECoG demonstrated that neural activity in speech-related cortical regions contains sufficient informa- tion to reconstruct acoustic features and classify phonemes [19, 41]. Sequence-to-sequence appr...
[3]

The proposed pipeline, depicted in Fig

Methods We evaluate an end-to-end intracortical speech decoder on the public Brain-to-Text ’25 benchmark. The proposed pipeline, depicted in Fig. 1, first applies a session-specific alignment layer to the neural features, followed by temporal patch embed- ding and a Conformer encoder that predicts character sequences with a CTC objective. During training,...

2048
[4]

Results In this section, we report the performance of the proposed model on the Brain-to-Text ’25 benchmark, focusing on the val- idation set (1,426 sentences), and analyze the main factors in- fluencing its behavior. 4.1. Overall Performance and Model Comparison We first compare in Table 3 the proposed Conformer-based model against the baseline provided ...

2023
[5]

Conclusion In this work, we presented an end-to-end Conformer-based decoder for intracortical speech neuroprostheses that directly maps neural activity to character sequences. By combining dataset augmentation, a session-specific alignment layer, tem- poral patch embedding, and a Conformer encoder trained with a CTC objective and entropy regularization, t...
[6]

Acknowledgement This work was supported by grants PID2022-141378OB- C22 and AIA2025-163317-C32 funded by MI- CIU/AEI/10.13039/501100011033 and ERDF/EU

work page doi:10.13039/501100011033
[7]

Brain-computer interfaces for restoring communi- cation,

E. F. Chang, “Brain-computer interfaces for restoring communi- cation,”New England Journal of Medicine, vol. 391, no. 7, pp. 654–657, 2024

2024
[8]

The speech neuroprosthesis,

A. B. Silva, K. T. Littlejohn, J. R. Liu, D. A. Moses, and E. F. Chang, “The speech neuroprosthesis,”Nature Reviews Neuro- science, vol. 25, no. 7, pp. 473–492, 2024

2024
[9]

Neuronal ensemble control of prosthetic devices by a human with tetraplegia,

L. R. Hochberg, M. D. Serruya, G. M. Friehs, J. A. Mukand, M. Saleh, A. H. Caplan, A. Branner, D. Chen, R. D. Penn, and J. P. Donoghue, “Neuronal ensemble control of prosthetic devices by a human with tetraplegia,”Nature, vol. 442, no. 7099, pp. 164– 171, 2006

2006
[10]

Cortical con- trol of arm movements: A dynamical systems perspective,

K. V . Shenoy, M. Sahani, and M. M. Churchland, “Cortical con- trol of arm movements: A dynamical systems perspective,”An- nual Review of Neuroscience, vol. 36, pp. 337–359, 2013

2013
[11]

Cognitive neural prosthetics,

R. A. Andersen, J. W. Burdick, S. Musallam, B. Pesaran, and J. G. Cham, “Cognitive neural prosthetics,”Trends in Cognitive Sciences, vol. 8, no. 11, pp. 486–493, 2004

2004
[12]

Connecting cortex to machines: Recent advances in brain interfaces,

J. P. Donoghue, “Connecting cortex to machines: Recent advances in brain interfaces,”Nature Neuroscience, vol. 5, pp. 1085–1088, 2002

2002
[13]

A spelling device for the paralysed,

N. Birbaumer, N. Ghanayim, T. Hinterberger, I. Iversen, B. Kotchoubey, A. K ¨ubler, J. Perelmouter, E. Taub, and H. Flor, “A spelling device for the paralysed,”Nature, vol. 398, no. 6725, pp. 297–298, 1999

1999
[14]

Brain-computer interfaces for communication and rehabilita- tion,

U. Chaudhary, N. Birbaumer, and A. Ramos-Murguialday, “Brain-computer interfaces for communication and rehabilita- tion,”Nature Reviews Neurology, vol. 12, no. 9, pp. 513–525, 2016

2016
[15]

Fully implanted brain-computer interface in a locked-in patient with ALS,

M. J. Vansteensel, E. G. M. Pels, M. G. Bleichner, M. P. Branco, T. Denison, Z. V . Freudenburg, P. Gosselaar, S. Leinders, T. H. Ottens, M. A. Van Den Boom, P. C. Van Rijen, E. J. Aarnoutse, and N. F. Ramsey, “Fully implanted brain-computer interface in a locked-in patient with ALS,”New England Journal of Medicine, vol. 375, no. 21, pp. 2060–2066, 2016

2060
[16]

Brain-computer interfaces for communica- tion and control,

J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M. Vaughan, “Brain-computer interfaces for communica- tion and control,”Clinical Neurophysiology, vol. 113, no. 6, pp. 767–791, 2002

2002
[17]

Brain-machine interfaces: Past, present and future,

M. A. Lebedev and M. A. L. Nicolelis, “Brain-machine interfaces: Past, present and future,”Trends in Neurosciences, vol. 29, no. 9, pp. 536–546, 2006

2006
[18]

A brain-computer interface using electrocortico- graphic signals in humans,

E. C. Leuthardt, G. Schalk, J. R. Wolpaw, J. G. Ojemann, and D. W. Moran, “A brain-computer interface using electrocortico- graphic signals in humans,”Journal of Neural Engineering, vol. 1, no. 2, pp. 63–71, 2004

2004
[19]

The open dataset of EEG motor imagery: BCI motor imagery data from healthy subjects,

G. Wang, C. Teng, K. Li, Z. Zhang, and Y . Chai, “The open dataset of EEG motor imagery: BCI motor imagery data from healthy subjects,”Frontiers in Neuroscience, vol. 16, p. 1044299, 2022

2022
[20]

Semantic reconstruc- tion of continuous language from non-invasive brain recordings,

J. Tang, A. LeBel, S. Jain, and A. G. Huth, “Semantic reconstruc- tion of continuous language from non-invasive brain recordings,” Nature Neuroscience, vol. 26, no. 5, pp. 858–866, 2023

2023
[21]

Enhancing detection of ssveps for a high-speed brain speller using task-related component analysis,

M. Nakanishi, Y . Wang, X. Chen, Y .-T. Wang, X. Gao, and T.- P. Jung, “Enhancing detection of ssveps for a high-speed brain speller using task-related component analysis,”IEEE Transac- tions on Biomedical Engineering, vol. 65, no. 1, pp. 104–112, 2018

2018
[22]

A comprehensive review of EEG-based brain-computer interface paradigms,

R. Abiri, S. Borhani, E. W. Sellers, Y . Jiang, and X. Zhao, “A comprehensive review of EEG-based brain-computer interface paradigms,”Journal of Neural Engineering, vol. 16, no. 1, p. 011001, 2019

2019
[23]

Machine translation of cortical activity to text with an encoder-decoder framework,

J. G. Makin, D. A. Moses, and E. F. Chang, “Machine translation of cortical activity to text with an encoder-decoder framework,” Nature Neuroscience, vol. 23, no. 4, pp. 575–582, 2020

2020
[24]

Neuropros- thesis for decoding speech in a paralyzed person with anarthria,

D. A. Moses, S. L. Metzger, J. R. Liu, G. K. Anumanchipalli, J. G. Makin, P. F. Sun, J. Chartier, M. E. Dougherty, P. M. Liu, G. M. Abrams, A. Tu-Chan, K. Ganguly, and E. F. Chang, “Neuropros- thesis for decoding speech in a paralyzed person with anarthria,” New England Journal of Medicine, vol. 385, no. 3, pp. 217–227, 2021

2021
[25]

Brain-to-text: Decoding spoken phrases from phone representations in the brain,

C. Herff, D. Heger, A. De Pesters, D. Telaar, P. Brunner, G. Schalk, and T. Schultz, “Brain-to-text: Decoding spoken phrases from phone representations in the brain,”Frontiers in Neuroscience, vol. 9, p. 217, 2015

2015
[26]

Speech synthesis from ECoG using densely connected 3D convolutional neural networks,

M. Angrick, C. Herff, E. Mugler, M. C. Tate, M. W. Slutzky, D. J. Krusienski, and T. Schultz, “Speech synthesis from ECoG using densely connected 3D convolutional neural networks,”Journal of Neural Engineering, vol. 16, no. 3, p. 036019, 2019

2019
[27]

High performance communication by people with paralysis using an intracortical brain-computer interface,

C. Pandarinath, P. Nuyujukian, C. H. Blabe, B. L. Sorice, J. Saab, F. R. Willett, L. R. Hochberg, K. V . Shenoy, and J. M. Hender- son, “High performance communication by people with paralysis using an intracortical brain-computer interface,”eLife, vol. 6, p. e18554, 2017

2017
[28]

Clini- cal translation of a high-performance neural prosthesis,

V . Gilja, C. Pandarinath, C. H. Blabe, P. Nuyujukian, J. D. Simeral, A. A. Sarma, B. L. Sorice, J. A. Perge, B. Jarosiewicz, L. R. Hochberg, K. V . Shenoy, and J. M. Henderson, “Clini- cal translation of a high-performance neural prosthesis,”Nature Medicine, vol. 21, no. 10, pp. 1142–1145, 2015

2015
[29]

Reach and grasp by people with tetraplegia using a neurally controlled robotic arm,

L. R. Hochberg, D. Bacher, B. Jarosiewicz, N. Y . Masse, J. D. Simeral, J. V ogel, S. Haddadin, J. Liu, S. S. Cash, P. van der Smagt, and J. P. Donoghue, “Reach and grasp by people with tetraplegia using a neurally controlled robotic arm,”Nature, vol. 485, no. 7398, pp. 372–375, 2012

2012
[30]

Accurate estimation of neural population dynam- ics without spike sorting,

E. M. Trautmann, S. D. Stavisky, S. Lahiri, K. C. Ames, M. T. Kaufman, D. J. O’Shea, S. Vyas, X. Sun, I. Bhowmick, S. Bhowmick, B. M. Yu, N. Even-Chen, J. M. Henderson, and K. V . Shenoy, “Accurate estimation of neural population dynam- ics without spike sorting,”Neuron, vol. 103, no. 2, pp. 292–308, 2019

2019
[31]

High-performance brain-to-text communica- tion via handwriting,

F. R. Willett, D. T. Avansino, L. R. Hochberg, J. M. Henderson, and K. V . Shenoy, “High-performance brain-to-text communica- tion via handwriting,”Nature, vol. 593, no. 7858, pp. 249–254, 2021

2021
[32]

A high-performance speech neuro- prosthesis,

F. R. Willett, E. M. Kunz, C. Fan, D. T. Avansino, G. H. Wilson, E. Y . Choi, F. Kamdar, L. R. Hochberg, J. M. Henderson, P. Bhatt, P. Rezaii, and K. V . Shenoy, “A high-performance speech neuro- prosthesis,”Nature, vol. 620, no. 7976, pp. 1031–1036, 2023

2023
[33]

An accurate and rapidly calibrating speech neuroprosthesis,

N. S. Card, M. Wairagkar, C. Iacono, P. Bhatt, T. Singer-Clark, F. R. Willett, K. C. Ames, J. Liu, P. Rezaii, L. R. Hochberg, J. M. Henderson, K. V . Shenoy, and D. M. Brandman, “An accurate and rapidly calibrating speech neuroprosthesis,”New England Journal of Medicine, vol. 391, no. 7, pp. 609–618, 2024

2024
[34]

wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460, 2020

2020
[35]

Robust speech recognition via large-scale weak su- pervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,”Proceedings of the International Conference on Ma- chine Learning, pp. 28 492–28 518, 2023

2023
[36]

Con- nectionist temporal classification: Labelling unsegmented se- quence data with recurrent neural networks,

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classification: Labelling unsegmented se- quence data with recurrent neural networks,”Proceedings of the International Conference on Machine Learning, pp. 369–376, 2006

2006
[37]

Deep Speech: Scaling up end-to-end speech recognition

A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y . Ng, “Deep speech: Scaling up end-to-end speech recogni- tion,” inarXiv preprint arXiv:1412.5567, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[38]

Single-trial dynamics of motor cortex and their applications to brain-machine interfaces,

J. C. Kao, P. Nuyujukian, S. I. Ryu, M. M. Churchland, J. P. Cunningham, and K. V . Shenoy, “Single-trial dynamics of motor cortex and their applications to brain-machine interfaces,”Nature Communications, vol. 6, p. 7759, 2015

2015
[39]

Conformer: Convolution-augmented transformer for speech recognition,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020, pp. 5036–5040

2020
[40]

Neural control of cursor trajectory and click by a human with tetraplegia 1000 days after implant of an intracorti- cal microelectrode array,

J. D. Simeral, S.-P. Kim, M. J. Black, J. P. Donoghue, and L. R. Hochberg, “Neural control of cursor trajectory and click by a human with tetraplegia 1000 days after implant of an intracorti- cal microelectrode array,”Journal of Neural Engineering, vol. 8, no. 2, p. 025027, 2011

2011
[41]

Neural manifolds for the control of movement,

J. A. Gallego, M. G. Perich, L. E. Miller, and S. A. Solla, “Neural manifolds for the control of movement,”Neuron, vol. 94, no. 5, pp. 978–984, 2017

2017
[42]

Single-unit stability us- ing chronically implanted multielectrode arrays in motor cortex of macaque monkeys,

C. A. Chestek, V . Gilja, P. Nuyujukian, J. D. Foster, J. M. Fan, M. T. Kaufman, M. M. Churchland, Z. Rivera-Alvidrez, J. P. Cun- ningham, S. I. Ryu, and K. V . Shenoy, “Single-unit stability us- ing chronically implanted multielectrode arrays in motor cortex of macaque monkeys,”Journal of Neurophysiology, vol. 105, no. 2, pp. 567–579, 2011

2011
[43]

Jasper: An end-to-end con- volutional neural acoustic model,

J. Li, V . Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen, and R. T. Gadde, “Jasper: An end-to-end con- volutional neural acoustic model,” inProc. Interspeech, 2019, pp. 71–75

2019
[44]

SpecAugment: A simple data augmen- tation method for automatic speech recognition,

D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “SpecAugment: A simple data augmen- tation method for automatic speech recognition,” inProc. Inter- speech, 2019, pp. 2613–2617

2019
[45]

Brain-computer interfaces for speech communica- tion,

J. S. Brumberg, A. Nieto-Castanon, P. R. Kennedy, and F. H. Guenther, “Brain-computer interfaces for speech communica- tion,”Speech Communication, vol. 52, no. 4, pp. 367–379, 2010

2010
[46]

Decoding spectrotemporal features of overt and covert speech from the hu- man cortex,

S. Martin, P. Brunner, C. Holdgraf, H.-J. Heinze, N. E. Crone, J. Rieger, G. Schalk, R. T. Knight, and B. N. Pasley, “Decoding spectrotemporal features of overt and covert speech from the hu- man cortex,”Frontiers in Neuroengineering, vol. 7, p. 14, 2014

2014
[47]

Speech syn- thesis from neural decoding of spoken sentences,

G. K. Anumanchipalli, J. Chartier, and E. F. Chang, “Speech syn- thesis from neural decoding of spoken sentences,”Nature, vol. 568, no. 7753, pp. 493–498, 2019

2019
[48]

A high-performance neuroprosthesis for speech decoding and avatar control,

S. L. Metzger, K. T. Littlejohn, A. B. Silva, D. A. Moses, M. P. Seaton, R. Wang, M. E. Dougherty, J. R. Liu, P. Wu, M. A. Berger, I. Zhuravleva, A. Tu-Chan, K. Ganguly, G. K. Anumanchipalli, and E. F. Chang, “A high-performance neuroprosthesis for speech decoding and avatar control,”Nature, vol. 620, no. 7976, pp. 1037–1046, 2023

2023
[49]

Listen, attend and spell: A neural network for large vocabulary conversa- tional speech recognition,

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversa- tional speech recognition,” inProceedings of the IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960–4964

2016
[50]

Virtual typing by people with tetraplegia using a self-calibrating intracor- tical brain-computer interface,

B. Jarosiewicz, A. A. Sarma, D. Bacher, N. Y . Masse, J. D. Simeral, B. Sorice, E. M. Oakley, C. Blabe, C. Pandarinath, V . Gilja, S. S. Cash, E. N. Eskandar, G. Friehs, J. M. Hender- son, K. V . Shenoy, J. P. Donoghue, and L. R. Hochberg, “Virtual typing by people with tetraplegia using a self-calibrating intracor- tical brain-computer interface,”Science...

2015
[51]

Stabilization of a brain- computer interface via the alignment of low-dimensional spaces of neural activity,

A. D. Degenhart, W. E. Bishop, E. R. Oby, E. C. Tyler-Kabara, S. M. Chase, A. P. Batista, and B. M. Yu, “Stabilization of a brain- computer interface via the alignment of low-dimensional spaces of neural activity,”Nature Biomedical Engineering, vol. 4, no. 7, pp. 672–685, 2020

2020
[52]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inProc. In- ternational Conference on Learning Representations, 2021

2021
[53]

Swin transformer: Hierarchical vision transformer us- ing shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer us- ing shifted windows,” inProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2021, pp. 10 012– 10 022

2021

[1] [1]

Introduction Neural speech prostheses [1, 2] represent one of the most am- bitious frontiers in modern neuroscience and biomedical engi- neering, offering the prospect of restoring lost communication to individuals with severe neurological conditions [3, 4, 5, 6]. Among the populations who stand to benefit most are those af- fected by amyotrophic lateral ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Related Work The decoding of speech and language from neural signals has progressed rapidly across multiple recording modalities [39, 40, 41]. Early work with ECoG demonstrated that neural activity in speech-related cortical regions contains sufficient informa- tion to reconstruct acoustic features and classify phonemes [19, 41]. Sequence-to-sequence appr...

[3] [3]

The proposed pipeline, depicted in Fig

Methods We evaluate an end-to-end intracortical speech decoder on the public Brain-to-Text ’25 benchmark. The proposed pipeline, depicted in Fig. 1, first applies a session-specific alignment layer to the neural features, followed by temporal patch embed- ding and a Conformer encoder that predicts character sequences with a CTC objective. During training,...

2048

[4] [4]

Results In this section, we report the performance of the proposed model on the Brain-to-Text ’25 benchmark, focusing on the val- idation set (1,426 sentences), and analyze the main factors in- fluencing its behavior. 4.1. Overall Performance and Model Comparison We first compare in Table 3 the proposed Conformer-based model against the baseline provided ...

2023

[5] [5]

Conclusion In this work, we presented an end-to-end Conformer-based decoder for intracortical speech neuroprostheses that directly maps neural activity to character sequences. By combining dataset augmentation, a session-specific alignment layer, tem- poral patch embedding, and a Conformer encoder trained with a CTC objective and entropy regularization, t...

[6] [6]

Acknowledgement This work was supported by grants PID2022-141378OB- C22 and AIA2025-163317-C32 funded by MI- CIU/AEI/10.13039/501100011033 and ERDF/EU

work page doi:10.13039/501100011033

[7] [7]

Brain-computer interfaces for restoring communi- cation,

E. F. Chang, “Brain-computer interfaces for restoring communi- cation,”New England Journal of Medicine, vol. 391, no. 7, pp. 654–657, 2024

2024

[8] [8]

The speech neuroprosthesis,

A. B. Silva, K. T. Littlejohn, J. R. Liu, D. A. Moses, and E. F. Chang, “The speech neuroprosthesis,”Nature Reviews Neuro- science, vol. 25, no. 7, pp. 473–492, 2024

2024

[9] [9]

Neuronal ensemble control of prosthetic devices by a human with tetraplegia,

L. R. Hochberg, M. D. Serruya, G. M. Friehs, J. A. Mukand, M. Saleh, A. H. Caplan, A. Branner, D. Chen, R. D. Penn, and J. P. Donoghue, “Neuronal ensemble control of prosthetic devices by a human with tetraplegia,”Nature, vol. 442, no. 7099, pp. 164– 171, 2006

2006

[10] [10]

Cortical con- trol of arm movements: A dynamical systems perspective,

K. V . Shenoy, M. Sahani, and M. M. Churchland, “Cortical con- trol of arm movements: A dynamical systems perspective,”An- nual Review of Neuroscience, vol. 36, pp. 337–359, 2013

2013

[11] [11]

Cognitive neural prosthetics,

R. A. Andersen, J. W. Burdick, S. Musallam, B. Pesaran, and J. G. Cham, “Cognitive neural prosthetics,”Trends in Cognitive Sciences, vol. 8, no. 11, pp. 486–493, 2004

2004

[12] [12]

Connecting cortex to machines: Recent advances in brain interfaces,

J. P. Donoghue, “Connecting cortex to machines: Recent advances in brain interfaces,”Nature Neuroscience, vol. 5, pp. 1085–1088, 2002

2002

[13] [13]

A spelling device for the paralysed,

N. Birbaumer, N. Ghanayim, T. Hinterberger, I. Iversen, B. Kotchoubey, A. K ¨ubler, J. Perelmouter, E. Taub, and H. Flor, “A spelling device for the paralysed,”Nature, vol. 398, no. 6725, pp. 297–298, 1999

1999

[14] [14]

Brain-computer interfaces for communication and rehabilita- tion,

U. Chaudhary, N. Birbaumer, and A. Ramos-Murguialday, “Brain-computer interfaces for communication and rehabilita- tion,”Nature Reviews Neurology, vol. 12, no. 9, pp. 513–525, 2016

2016

[15] [15]

Fully implanted brain-computer interface in a locked-in patient with ALS,

M. J. Vansteensel, E. G. M. Pels, M. G. Bleichner, M. P. Branco, T. Denison, Z. V . Freudenburg, P. Gosselaar, S. Leinders, T. H. Ottens, M. A. Van Den Boom, P. C. Van Rijen, E. J. Aarnoutse, and N. F. Ramsey, “Fully implanted brain-computer interface in a locked-in patient with ALS,”New England Journal of Medicine, vol. 375, no. 21, pp. 2060–2066, 2016

2060

[16] [16]

Brain-computer interfaces for communica- tion and control,

J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M. Vaughan, “Brain-computer interfaces for communica- tion and control,”Clinical Neurophysiology, vol. 113, no. 6, pp. 767–791, 2002

2002

[17] [17]

Brain-machine interfaces: Past, present and future,

M. A. Lebedev and M. A. L. Nicolelis, “Brain-machine interfaces: Past, present and future,”Trends in Neurosciences, vol. 29, no. 9, pp. 536–546, 2006

2006

[18] [18]

A brain-computer interface using electrocortico- graphic signals in humans,

E. C. Leuthardt, G. Schalk, J. R. Wolpaw, J. G. Ojemann, and D. W. Moran, “A brain-computer interface using electrocortico- graphic signals in humans,”Journal of Neural Engineering, vol. 1, no. 2, pp. 63–71, 2004

2004

[19] [19]

The open dataset of EEG motor imagery: BCI motor imagery data from healthy subjects,

G. Wang, C. Teng, K. Li, Z. Zhang, and Y . Chai, “The open dataset of EEG motor imagery: BCI motor imagery data from healthy subjects,”Frontiers in Neuroscience, vol. 16, p. 1044299, 2022

2022

[20] [20]

Semantic reconstruc- tion of continuous language from non-invasive brain recordings,

J. Tang, A. LeBel, S. Jain, and A. G. Huth, “Semantic reconstruc- tion of continuous language from non-invasive brain recordings,” Nature Neuroscience, vol. 26, no. 5, pp. 858–866, 2023

2023

[21] [21]

Enhancing detection of ssveps for a high-speed brain speller using task-related component analysis,

M. Nakanishi, Y . Wang, X. Chen, Y .-T. Wang, X. Gao, and T.- P. Jung, “Enhancing detection of ssveps for a high-speed brain speller using task-related component analysis,”IEEE Transac- tions on Biomedical Engineering, vol. 65, no. 1, pp. 104–112, 2018

2018

[22] [22]

A comprehensive review of EEG-based brain-computer interface paradigms,

R. Abiri, S. Borhani, E. W. Sellers, Y . Jiang, and X. Zhao, “A comprehensive review of EEG-based brain-computer interface paradigms,”Journal of Neural Engineering, vol. 16, no. 1, p. 011001, 2019

2019

[23] [23]

Machine translation of cortical activity to text with an encoder-decoder framework,

J. G. Makin, D. A. Moses, and E. F. Chang, “Machine translation of cortical activity to text with an encoder-decoder framework,” Nature Neuroscience, vol. 23, no. 4, pp. 575–582, 2020

2020

[24] [24]

Neuropros- thesis for decoding speech in a paralyzed person with anarthria,

D. A. Moses, S. L. Metzger, J. R. Liu, G. K. Anumanchipalli, J. G. Makin, P. F. Sun, J. Chartier, M. E. Dougherty, P. M. Liu, G. M. Abrams, A. Tu-Chan, K. Ganguly, and E. F. Chang, “Neuropros- thesis for decoding speech in a paralyzed person with anarthria,” New England Journal of Medicine, vol. 385, no. 3, pp. 217–227, 2021

2021

[25] [25]

Brain-to-text: Decoding spoken phrases from phone representations in the brain,

C. Herff, D. Heger, A. De Pesters, D. Telaar, P. Brunner, G. Schalk, and T. Schultz, “Brain-to-text: Decoding spoken phrases from phone representations in the brain,”Frontiers in Neuroscience, vol. 9, p. 217, 2015

2015

[26] [26]

Speech synthesis from ECoG using densely connected 3D convolutional neural networks,

M. Angrick, C. Herff, E. Mugler, M. C. Tate, M. W. Slutzky, D. J. Krusienski, and T. Schultz, “Speech synthesis from ECoG using densely connected 3D convolutional neural networks,”Journal of Neural Engineering, vol. 16, no. 3, p. 036019, 2019

2019

[27] [27]

High performance communication by people with paralysis using an intracortical brain-computer interface,

C. Pandarinath, P. Nuyujukian, C. H. Blabe, B. L. Sorice, J. Saab, F. R. Willett, L. R. Hochberg, K. V . Shenoy, and J. M. Hender- son, “High performance communication by people with paralysis using an intracortical brain-computer interface,”eLife, vol. 6, p. e18554, 2017

2017

[28] [28]

Clini- cal translation of a high-performance neural prosthesis,

V . Gilja, C. Pandarinath, C. H. Blabe, P. Nuyujukian, J. D. Simeral, A. A. Sarma, B. L. Sorice, J. A. Perge, B. Jarosiewicz, L. R. Hochberg, K. V . Shenoy, and J. M. Henderson, “Clini- cal translation of a high-performance neural prosthesis,”Nature Medicine, vol. 21, no. 10, pp. 1142–1145, 2015

2015

[29] [29]

Reach and grasp by people with tetraplegia using a neurally controlled robotic arm,

L. R. Hochberg, D. Bacher, B. Jarosiewicz, N. Y . Masse, J. D. Simeral, J. V ogel, S. Haddadin, J. Liu, S. S. Cash, P. van der Smagt, and J. P. Donoghue, “Reach and grasp by people with tetraplegia using a neurally controlled robotic arm,”Nature, vol. 485, no. 7398, pp. 372–375, 2012

2012

[30] [30]

Accurate estimation of neural population dynam- ics without spike sorting,

E. M. Trautmann, S. D. Stavisky, S. Lahiri, K. C. Ames, M. T. Kaufman, D. J. O’Shea, S. Vyas, X. Sun, I. Bhowmick, S. Bhowmick, B. M. Yu, N. Even-Chen, J. M. Henderson, and K. V . Shenoy, “Accurate estimation of neural population dynam- ics without spike sorting,”Neuron, vol. 103, no. 2, pp. 292–308, 2019

2019

[31] [31]

High-performance brain-to-text communica- tion via handwriting,

F. R. Willett, D. T. Avansino, L. R. Hochberg, J. M. Henderson, and K. V . Shenoy, “High-performance brain-to-text communica- tion via handwriting,”Nature, vol. 593, no. 7858, pp. 249–254, 2021

2021

[32] [32]

A high-performance speech neuro- prosthesis,

F. R. Willett, E. M. Kunz, C. Fan, D. T. Avansino, G. H. Wilson, E. Y . Choi, F. Kamdar, L. R. Hochberg, J. M. Henderson, P. Bhatt, P. Rezaii, and K. V . Shenoy, “A high-performance speech neuro- prosthesis,”Nature, vol. 620, no. 7976, pp. 1031–1036, 2023

2023

[33] [33]

An accurate and rapidly calibrating speech neuroprosthesis,

N. S. Card, M. Wairagkar, C. Iacono, P. Bhatt, T. Singer-Clark, F. R. Willett, K. C. Ames, J. Liu, P. Rezaii, L. R. Hochberg, J. M. Henderson, K. V . Shenoy, and D. M. Brandman, “An accurate and rapidly calibrating speech neuroprosthesis,”New England Journal of Medicine, vol. 391, no. 7, pp. 609–618, 2024

2024

[34] [34]

wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460, 2020

2020

[35] [35]

Robust speech recognition via large-scale weak su- pervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,”Proceedings of the International Conference on Ma- chine Learning, pp. 28 492–28 518, 2023

2023

[36] [36]

Con- nectionist temporal classification: Labelling unsegmented se- quence data with recurrent neural networks,

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classification: Labelling unsegmented se- quence data with recurrent neural networks,”Proceedings of the International Conference on Machine Learning, pp. 369–376, 2006

2006

[37] [37]

Deep Speech: Scaling up end-to-end speech recognition

A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y . Ng, “Deep speech: Scaling up end-to-end speech recogni- tion,” inarXiv preprint arXiv:1412.5567, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[38] [38]

Single-trial dynamics of motor cortex and their applications to brain-machine interfaces,

J. C. Kao, P. Nuyujukian, S. I. Ryu, M. M. Churchland, J. P. Cunningham, and K. V . Shenoy, “Single-trial dynamics of motor cortex and their applications to brain-machine interfaces,”Nature Communications, vol. 6, p. 7759, 2015

2015

[39] [39]

Conformer: Convolution-augmented transformer for speech recognition,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020, pp. 5036–5040

2020

[40] [40]

Neural control of cursor trajectory and click by a human with tetraplegia 1000 days after implant of an intracorti- cal microelectrode array,

J. D. Simeral, S.-P. Kim, M. J. Black, J. P. Donoghue, and L. R. Hochberg, “Neural control of cursor trajectory and click by a human with tetraplegia 1000 days after implant of an intracorti- cal microelectrode array,”Journal of Neural Engineering, vol. 8, no. 2, p. 025027, 2011

2011

[41] [41]

Neural manifolds for the control of movement,

J. A. Gallego, M. G. Perich, L. E. Miller, and S. A. Solla, “Neural manifolds for the control of movement,”Neuron, vol. 94, no. 5, pp. 978–984, 2017

2017

[42] [42]

Single-unit stability us- ing chronically implanted multielectrode arrays in motor cortex of macaque monkeys,

C. A. Chestek, V . Gilja, P. Nuyujukian, J. D. Foster, J. M. Fan, M. T. Kaufman, M. M. Churchland, Z. Rivera-Alvidrez, J. P. Cun- ningham, S. I. Ryu, and K. V . Shenoy, “Single-unit stability us- ing chronically implanted multielectrode arrays in motor cortex of macaque monkeys,”Journal of Neurophysiology, vol. 105, no. 2, pp. 567–579, 2011

2011

[43] [43]

Jasper: An end-to-end con- volutional neural acoustic model,

J. Li, V . Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen, and R. T. Gadde, “Jasper: An end-to-end con- volutional neural acoustic model,” inProc. Interspeech, 2019, pp. 71–75

2019

[44] [44]

SpecAugment: A simple data augmen- tation method for automatic speech recognition,

D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “SpecAugment: A simple data augmen- tation method for automatic speech recognition,” inProc. Inter- speech, 2019, pp. 2613–2617

2019

[45] [45]

Brain-computer interfaces for speech communica- tion,

J. S. Brumberg, A. Nieto-Castanon, P. R. Kennedy, and F. H. Guenther, “Brain-computer interfaces for speech communica- tion,”Speech Communication, vol. 52, no. 4, pp. 367–379, 2010

2010

[46] [46]

Decoding spectrotemporal features of overt and covert speech from the hu- man cortex,

S. Martin, P. Brunner, C. Holdgraf, H.-J. Heinze, N. E. Crone, J. Rieger, G. Schalk, R. T. Knight, and B. N. Pasley, “Decoding spectrotemporal features of overt and covert speech from the hu- man cortex,”Frontiers in Neuroengineering, vol. 7, p. 14, 2014

2014

[47] [47]

Speech syn- thesis from neural decoding of spoken sentences,

G. K. Anumanchipalli, J. Chartier, and E. F. Chang, “Speech syn- thesis from neural decoding of spoken sentences,”Nature, vol. 568, no. 7753, pp. 493–498, 2019

2019

[48] [48]

A high-performance neuroprosthesis for speech decoding and avatar control,

S. L. Metzger, K. T. Littlejohn, A. B. Silva, D. A. Moses, M. P. Seaton, R. Wang, M. E. Dougherty, J. R. Liu, P. Wu, M. A. Berger, I. Zhuravleva, A. Tu-Chan, K. Ganguly, G. K. Anumanchipalli, and E. F. Chang, “A high-performance neuroprosthesis for speech decoding and avatar control,”Nature, vol. 620, no. 7976, pp. 1037–1046, 2023

2023

[49] [49]

Listen, attend and spell: A neural network for large vocabulary conversa- tional speech recognition,

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversa- tional speech recognition,” inProceedings of the IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960–4964

2016

[50] [50]

Virtual typing by people with tetraplegia using a self-calibrating intracor- tical brain-computer interface,

B. Jarosiewicz, A. A. Sarma, D. Bacher, N. Y . Masse, J. D. Simeral, B. Sorice, E. M. Oakley, C. Blabe, C. Pandarinath, V . Gilja, S. S. Cash, E. N. Eskandar, G. Friehs, J. M. Hender- son, K. V . Shenoy, J. P. Donoghue, and L. R. Hochberg, “Virtual typing by people with tetraplegia using a self-calibrating intracor- tical brain-computer interface,”Science...

2015

[51] [51]

Stabilization of a brain- computer interface via the alignment of low-dimensional spaces of neural activity,

A. D. Degenhart, W. E. Bishop, E. R. Oby, E. C. Tyler-Kabara, S. M. Chase, A. P. Batista, and B. M. Yu, “Stabilization of a brain- computer interface via the alignment of low-dimensional spaces of neural activity,”Nature Biomedical Engineering, vol. 4, no. 7, pp. 672–685, 2020

2020

[52] [52]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inProc. In- ternational Conference on Learning Representations, 2021

2021

[53] [53]

Swin transformer: Hierarchical vision transformer us- ing shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer us- ing shifted windows,” inProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2021, pp. 10 012– 10 022

2021