pith. sign in

arxiv: 2604.16441 · v1 · submitted 2026-04-07 · 💻 cs.SD · cs.AI· cs.CL

iPhoneme: Brain-to-Text Communication for ALS Using ConformerXL Decoding

Pith reviewed 2026-05-10 19:04 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CL
keywords phonemesystemaccuracybrain-to-textcommunicationinputiphonemespeech
0
0 comments X

The pith

A modified Conformer model decodes intracranial EEG into text at 92% phoneme accuracy for ALS patients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes iPhoneme, a system for restoring speech communication in people with ALS by translating brain signals directly into text. It combines a large ConformerXL neural network for decoding phonemes from 256-channel brain recordings with a gaze-assisted interface that avoids slow dwell-time selections. On a dataset of 45 sessions, the system reaches 92.14% phoneme accuracy and 73.39% word accuracy while running with 180 milliseconds latency on a CPU. Sympathetic readers would care because this performance level could allow practical, real-time text entry for those who can no longer speak or move.

Core claim

The central claim is that the iPhoneme system, which uses a ConformerXL acoustic model with temporal prenet, multi-scale dilated convolutions, bidirectional GRU, and Pre-RMSNorm across 12 encoder blocks, together with a 6-gram phoneme language model and WFST beam search, achieves 92.14% phoneme accuracy and 73.39% word accuracy on the T15 intracranial EEG dataset, about 3% above prior state-of-the-art, while operating in real time at 180 ms CPU latency.

What carries the argument

The ConformerXL decoder that processes neural signals from speech motor cortex to predict phoneme sequences, stabilized by Pre-RMSNorm and trained with AdamW and cosine scheduling, integrated with a chorded gaze and silent-speech input method.

Load-bearing premise

The performance levels measured on the T15 dataset will generalize to new ALS patients, different recording setups, and extended real-world use without major accuracy loss or need for substantial retraining.

What would settle it

Recording intracranial EEG from a new ALS patient in a different setup and finding phoneme accuracy below 80% or word accuracy below 50% without model retraining would indicate the claim does not hold broadly.

Figures

Figures reproduced from arXiv: 2604.16441 by Dawit Chun, Sung Park, Yoonmin Cha.

Figure 1
Figure 1. Figure 1: T15 dataset overview showing iEEG waveforms, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Dataset split composition. Top: overall distribution. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Input feature heatmap for a representative trial. The [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Data processing pipeline. 3D surface plots show the 512-channel signal at each stage for a 0.5s trial (T=0.5s). Bandpass [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: End-to-end pipeline. The ConformerXL decodes iEEG [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: ConformerXL architecture. The temporal prenet extracts multi-scale features via dilated convolutions and BiGRU before [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Temporal prenet stage-by-stage processing. Dilated [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: BiGRU temporal jitter correction. Forward and back [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Single ConformerXL block with Pre-RMSNorm before each sub-layer. FFN uses GELU + 0.5 [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: CTC training. Multiple alignment paths through blank [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: WFST graph structure. Nodes are states; edges [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Beam search visualization (simplified to width 3). [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Gaze-phoneme interaction modalities. Top: swipe [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Phoneme frequency analysis from LibriSpeech (365M [PITH_FULL_IMAGE:figures/full_fig_p011_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Safety vs. accuracy trade-off. Ideal triggers (upper [PITH_FULL_IMAGE:figures/full_fig_p011_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Per-phoneme confusion matrices (train left, validation [PITH_FULL_IMAGE:figures/full_fig_p012_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Left: per-phoneme accuracy ranking (avg. 85.7%). [PITH_FULL_IMAGE:figures/full_fig_p012_19.png] view at source ↗
read the original abstract

Brain-computer interfaces (BCIs) for speech restoration hold transformative potential for the approximately 173,000--232,500 individuals worldwide with ALS-related dysarthria. Despite recent progress, high-performance speech BCIs have been demonstrated in only 22--31 patients globally, largely due to limitations in neural decoding accuracy and practical input interfaces. We present iPhoneme, a brain-to-text communication system that jointly addresses these challenges through integrated modeling and interaction design. The system combines a deep learning phoneme decoder based on a modified Conformer architecture (ConformerXL, 192.9M parameters) with a gaze-assisted phoneme input interface that mitigates the Midas touch problem in eye-tracking systems. The acoustic model incorporates a temporal prenet with multi-scale dilated convolutions and bidirectional GRU for neural jitter correction, temporal subsampling for CTC stability, and Pre-RMSNorm stabilization across 12 encoder blocks, trained with AdamW and cosine scheduling. On the interaction side, iPhoneme introduces a chorded gaze-plus-silent-speech paradigm that replaces dwell-time selection, enabling more efficient input. We evaluate the system on the T15 dataset (45 sessions, 8,071 trials) of 256-channel intracranial EEG from speech motor cortex regions. A 6-gram phoneme language model trained on 3.1M sequences, combined with WFST beam search (beam=128), achieves 92.14% phoneme accuracy (7.86% PER) and 73.39% word accuracy (26.61% WER), approximately 3% above prior state-of-the-art. The system operates on CPU with 180 ms latency, demonstrating real-time, high-accuracy brain-to-text communication for ALS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to present iPhoneme, a brain-to-text BCI system for ALS using a modified ConformerXL decoder on 256-channel iEEG data from the T15 dataset. It reports achieving 92.14% phoneme accuracy (7.86% PER) and 73.39% word accuracy (26.61% WER) on 45 sessions/8,071 trials, approximately 3% above prior SOTA, with 180 ms CPU latency, using a 6-gram LM and WFST decoding, alongside a gaze-assisted phoneme input interface.

Significance. Should the results prove robust, this work would mark a significant step in high-accuracy, real-time speech decoding for BCIs, potentially benefiting ALS patients with dysarthria. The large model size (192.9M parameters), specific architectural choices like multi-scale dilated convolutions and Pre-RMSNorm, and the low-latency CPU implementation are notable strengths. The integration of decoding with an interaction paradigm to address Midas touch problem adds practical value. However, the single-subject evaluation constrains the immediate transformative potential.

major comments (3)
  1. [Abstract] The central performance claims (92.14% phoneme accuracy and 73.39% word accuracy) are presented without error bars, details on the train/test split beyond 'held-out trials', statistical tests, or ablation results. This weakens the ability to verify the ~3% improvement over prior state-of-the-art as statistically meaningful.
  2. [Abstract] Evaluation is limited to the T15 dataset from a single subject (speech motor cortex). The introduction highlights the need for BCIs applicable to many ALS patients, yet no cross-patient transfer, multi-subject validation, or leave-one-out experiments are reported. BCI performance is highly subject-dependent, so this is a load-bearing limitation for the claimed relevance to ALS communication restoration.
  3. [Model description (in abstract and methods)] The ConformerXL modifications (temporal prenet with multi-scale dilated convs + biGRU, temporal subsampling, Pre-RMSNorm across 12 blocks) are described, but without ablation studies or comparisons showing their individual impacts on the PER/WER metrics, it is unclear if they are necessary for the reported gains.
minor comments (2)
  1. [Abstract] The dataset is referred to as 'T15' without a reference or brief description of its origin, size, or collection protocol beyond the session/trial counts.
  2. [Abstract] The latency is specified as '180 ms CPU latency' but it is not clear if this includes the full pipeline (decoding + LM + interface) or just the acoustic model.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive assessment of our work's potential significance and for the detailed major comments. We address each point below and commit to revisions that enhance the manuscript's clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract] The central performance claims (92.14% phoneme accuracy and 73.39% word accuracy) are presented without error bars, details on the train/test split beyond 'held-out trials', statistical tests, or ablation results. This weakens the ability to verify the ~3% improvement over prior state-of-the-art as statistically meaningful.

    Authors: We concur that these elements are important for robust claims. In the revised manuscript, we will include error bars (standard deviation across 5 random seeds), specify the train/test split (e.g., 80/20 on sessions with held-out trials from later sessions), and add statistical tests such as a bootstrap confidence interval for the PER/WER to confirm the improvement over prior SOTA is significant. revision: yes

  2. Referee: [Abstract] Evaluation is limited to the T15 dataset from a single subject (speech motor cortex). The introduction highlights the need for BCIs applicable to many ALS patients, yet no cross-patient transfer, multi-subject validation, or leave-one-out experiments are reported. BCI performance is highly subject-dependent, so this is a load-bearing limitation for the claimed relevance to ALS communication restoration.

    Authors: This is a fair observation. Our evaluation is confined to the single-subject T15 dataset, as is typical for intracranial recordings in BCI research. We will revise the discussion to include a limitations paragraph acknowledging subject-specific variability in BCI performance, citing relevant literature on single-subject studies, and suggesting pathways for future multi-subject generalization. We cannot introduce new multi-subject data in this revision. revision: partial

  3. Referee: [Model description (in abstract and methods)] The ConformerXL modifications (temporal prenet with multi-scale dilated convs + biGRU, temporal subsampling, Pre-RMSNorm across 12 blocks) are described, but without ablation studies or comparisons showing their individual impacts on the PER/WER metrics, it is unclear if they are necessary for the reported gains.

    Authors: We agree that demonstrating the necessity of these modifications requires ablations. We will add ablation studies in the revised results section, training and evaluating model variants with each component ablated individually and reporting the corresponding phoneme error rate (PER) and word error rate (WER) metrics. This will quantify the contribution of the temporal prenet, dilated convolutions, biGRU, subsampling, and Pre-RMSNorm. revision: yes

Circularity Check

0 steps flagged

No significant circularity in reported performance metrics

full rationale

The paper's central claims consist of empirical accuracy numbers (92.14% phoneme accuracy, 73.39% word accuracy) obtained by evaluating a trained ConformerXL model on held-out trials from the T15 dataset. These are measured outcomes on standard train/test splits rather than quantities that reduce by construction to fitted parameters, self-citations, or definitional inputs. No equations, uniqueness theorems, or ansatzes are presented in the provided text that would create a self-definitional or fitted-input-called-prediction loop. The architecture description and training details follow conventional ML practices without load-bearing reductions to the final metrics.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard deep-learning assumptions about EEG signal stationarity and the effectiveness of the modified Conformer blocks; no new physical entities are postulated.

free parameters (2)
  • ConformerXL model size (192.9M parameters)
    Large number of trainable weights fitted to the T15 training data.
  • 6-gram language model and WFST beam size
    Hyperparameters chosen to optimize final word accuracy on the evaluation set.
axioms (2)
  • domain assumption A temporal prenet with multi-scale dilated convolutions and bidirectional GRU can correct neural jitter in intracranial EEG sufficiently for CTC training.
    Invoked in the acoustic model description.
  • domain assumption Pre-RMSNorm stabilization across 12 encoder blocks enables stable training of the 192.9M-parameter model.
    Stated as part of the architecture design.

pith-pipeline@v0.9.0 · 5626 in / 1562 out tokens · 65827 ms · 2026-05-10T19:04:36.255632+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

  1. [1]

    Optuna: A next-generation hyperparameter optimization framework,

    T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-generation hyperparameter optimization framework,” inProc. ACM SIGKDD, 2019, pp. 2623–2631

  2. [2]

    Speech synthesis from neural decoding of spoken sentences,

    G. K. Anumanchipalli, J. Chartier, and E. F. Chang, “Speech synthesis from neural decoding of spoken sentences,”Nature, vol. 568, pp. 493– 498, 2019

  3. [3]

    [Online]

    Augmental,MouthPadˆ: A Palate-Mounted Hands-Free Touchpad, 2023. [Online]. Available:https://www.augmental.tech/

  4. [4]

    Functional organization of human sensorimotor cortex for speech articulation,

    K. E. Bouchard, N. Mesgarani, K. Johnson, and E. F. Chang, “Functional organization of human sensorimotor cortex for speech articulation,” Nature, vol. 495, pp. 327–332, 2013

  5. [5]

    An accurate and rapidly calibrating speech neuro- prosthesis,

    N. S. Cardet al., “An accurate and rapidly calibrating speech neuro- prosthesis,”N. Engl. J. Med., 2024

  6. [6]

    An empirical study of smoothing tech- niques for language modeling,

    S. F. Chen and J. Goodman, “An empirical study of smoothing tech- niques for language modeling,”Computer Speech & Language, vol. 13, pp. 359–394, 1999

  7. [7]

    Learning phrase representations using RNN encoder- decoder for statistical machine translation,

    K. Choet al., “Learning phrase representations using RNN encoder- decoder for statistical machine translation,” inProc. EMNLP, 2014, pp. 1724–1734

  8. [8]

    Carnegie Mellon University,The CMU Pronouncing Dictionary, 2014

  9. [9]

    The information capacity of the human motor system in controlling the amplitude of movement,

    P. M. Fitts, “The information capacity of the human motor system in controlling the amplitude of movement,”J. Exp. Psychology, vol. 47, pp. 381–391, 1954

  10. [10]

    Connectionist temporal classification,

    A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification,” inProc. ICML, 2006, pp. 369–376

  11. [11]

    F. H. Guenther,Neural Control of Speech. MIT Press, 2016

  12. [12]

    Conformer: Convolution-augmented transformer for speech recognition,

    A. Gulatiet al., “Conformer: Convolution-augmented transformer for speech recognition,” inProc. Interspeech, 2020, pp. 5036–5040

  13. [13]

    KenLM: Faster and smaller language model queries,

    K. Heafield, “KenLM: Faster and smaller language model queries,” in Proc. WMT, 2011, pp. 187–197

  14. [14]

    Gaussian Error Linear Units (GELUs)

    D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” arXiv:1606.08415, 2016

  15. [15]

    Brain-to-text: Decoding spoken phrases from phone representations in the brain,

    C. Herffet al., “Brain-to-text: Decoding spoken phrases from phone representations in the brain,”Frontiers in Neuroscience, vol. 9, art. 217, 2015

  16. [16]

    What you look at is what you get: Eye movement-based interaction techniques,

    R. J. K. Jacob, “What you look at is what you get: Eye movement-based interaction techniques,” inProc. CHI, 1990, pp. 11–18

  17. [17]

    Jurafsky and J

    D. Jurafsky and J. H. Martin,Speech and Language Processing, 2nd ed. Pearson, 2009

  18. [18]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inProc. ICLR, 2019

  19. [19]

    A review of classification algorithms for EEG-based BCIs: A 10 year update,

    F. Lotteet al., “A review of classification algorithms for EEG-based BCIs: A 10 year update,”J. Neural Eng., vol. 15, art. 031005, 2018

  20. [20]

    Twenty years of eye typing: Systems and design issues,

    P. Majaranta and K.-J. R ¨aih¨a, “Twenty years of eye typing: Systems and design issues,” inProc. ETRA, 2002, pp. 15–22

  21. [21]

    Weighted finite-state transducers in speech recognition,

    M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducers in speech recognition,”Computer Speech & Language, vol. 16, pp. 69–88, 2002

  22. [22]

    Neural speech recognition: Continuous phoneme decoding,

    D. A. Moseset al., “Neural speech recognition: Continuous phoneme decoding,”J. Neural Eng., vol. 16, art. 056004, 2019

  23. [23]

    Librispeech: An ASR corpus based on public domain audio books,

    V . Panayotovet al., “Librispeech: An ASR corpus based on public domain audio books,” inProc. IEEE ICASSP, 2015, pp. 5206–5210

  24. [24]

    SpecAugment: A simple data augmentation method for automatic speech recognition,

    D. S. Parket al., “SpecAugment: A simple data augmentation method for automatic speech recognition,” inProc. Interspeech, 2019, pp. 2613– 2617

  25. [25]

    PyTorch: An imperative style, high-performance deep learning library,

    A. Paszkeet al., “PyTorch: An imperative style, high-performance deep learning library,” inNeurIPS, vol. 32, 2019

  26. [26]

    Stanford Neural Prosthetics Lab,T15 Dataset: Intracranial EEG Record- ings from Speech Motor Cortex, Stanford University, 2023

  27. [27]

    Attention is all you need,

    A. Vaswaniet al., “Attention is all you need,” inNeurIPS, vol. 30, 2017

  28. [28]

    Towards gaze-mediated interaction: Collecting solutions of the ‘Midas touch problem’,

    B. Velichkovsky, A. Sprenger, and P. Unema, “Towards gaze-mediated interaction: Collecting solutions of the ‘Midas touch problem’,” inProc. INTERACT, 1997, pp. 509–516

  29. [29]

    Neural speech decoding with intracranial record- ings,

    D. Wairagkaret al., “Neural speech decoding with intracranial record- ings,” 2024

  30. [30]

    High-performance brain-to-text communication via handwriting,

    F. R. Willettet al., “High-performance brain-to-text communication via handwriting,”Nature, vol. 593, pp. 249–254, 2021

  31. [31]

    A high-performance speech neuroprosthesis,

    F. R. Willettet al., “A high-performance speech neuroprosthesis,” Nature, vol. 620, pp. 1031–1036, 2023

  32. [32]

    Global prevalence and incidence of ALS: A system- atic review,

    C. Wolfsonet al., “Global prevalence and incidence of ALS: A system- atic review,”Neurology, vol. 101, pp. e613–e623, 2023

  33. [33]

    Root mean square layer normalization,

    B. Zhang and R. Sennrich, “Root mean square layer normalization,” in NeurIPS, 2019