TokenSE: a Mamba-based discrete token speech enhancement framework for cochlear implants
Pith reviewed 2026-05-10 16:16 UTC · model grok-4.3
The pith
TokenSE uses a Mamba model to predict clean neural audio codec tokens from noisy inputs and improves speech intelligibility for cochlear implant users.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose TokenSE, a Mamba-based discrete token speech enhancement framework that predicts clean codec token indices from degraded inputs to restore speech quality and intelligibility for cochlear implant users, with linear computational complexity enabling practical use in hearing devices.
What carries the argument
Mamba-based predictor of clean neural audio codec token indices from degraded speech, which uses input-dependent selection to achieve linear complexity instead of quadratic self-attention for long token sequences.
If this is right
- Outperforms baseline methods on both in-domain and out-of-domain datasets according to objective metrics.
- Delivers measurable gains in speech intelligibility for cochlear implant users in adverse noisy and reverberant environments based on subjective tests.
- Provides a linear-complexity alternative to Transformer models suitable for real-time processing on hearing devices.
- Extends potential use to hearing-aid applications where computational efficiency matters.
Where Pith is reading between the lines
- The discrete token representation may allow easier combination with other codec-based audio models for additional processing stages.
- Efficiency advantages could support deployment on battery-constrained devices beyond the tested cochlear implants.
- The approach might generalize to multi-speaker or more varied acoustic scenes not covered in the reported experiments.
Load-bearing premise
That accurate prediction of clean codec token indices from degraded inputs will produce decoded audio that cochlear implant users perceive as more intelligible in real noisy and reverberant conditions.
What would settle it
A listening test in which cochlear implant users show no statistically significant gain in word recognition scores with TokenSE-processed signals versus baseline-enhanced or unprocessed signals under matched noisy conditions.
Figures
read the original abstract
Speech enhancement (SE) is critical for improving speech intelligibility and quality in real-world environments, particularly for cochlear implant (CI) users who experience severe degradations in speech understanding under noisy and reverberant conditions. In this study, we propose TokenSE, a discrete token-based SE framework operating in the neural audio codec space, which predicts clean codec token indices from degraded speech using a Mamba-based model. Unlike the earlier Transformer architecture, whose self-attention mechanism has a computational complexity that grows quadratically with sequence length, the input-dependent selection mechanism of Mamba achieves linear complexity, making it a compelling alternative to Transformers, especially for CI and hearing-aid (HA) applications. Objective evaluations show that TokenSE consistently outperforms baseline methods on both in-domain and out-of-domain datasets. Moreover, subjective listening experiments with CI users indicate clear benefit in speech intelligibility under adverse noisy and reverberant environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TokenSE, a Mamba-based framework for speech enhancement that predicts clean neural audio codec token indices from degraded speech inputs. Targeted at cochlear implant (CI) users, it replaces Transformer self-attention with Mamba's input-dependent selection for linear complexity. The central claims are that TokenSE consistently outperforms baselines on objective metrics (in-domain and out-of-domain datasets) and yields clear speech-intelligibility benefits for CI users in subjective listening tests under noisy and reverberant conditions.
Significance. If the empirical claims hold with adequate controls and statistics, the work would be significant for CI and hearing-aid applications: discrete-token modeling aligns naturally with CI front-ends, and Mamba's linear scaling addresses real-time constraints that quadratic attention cannot. The emphasis on out-of-domain generalization and actual CI-user testing is a strength relative to purely objective-metric papers. No parameter-free derivations or machine-checked proofs are present, so significance rests entirely on the quality of the reported experiments.
major comments (1)
- [Abstract / Results] Abstract and Results section: The claim that 'subjective listening experiments with CI users indicate clear benefit in speech intelligibility under adverse noisy and reverberant environments' is load-bearing for the perceptual contribution yet supplies no participant count, test materials (e.g., sentence lists), presentation levels, CI processor settings, counter-balancing, or statistical tests (p-values, effect sizes). Because objective metrics such as PESQ/STOI are known to correlate only loosely with CI intelligibility, this omission prevents verification that token-index prediction translates into real-world benefit.
minor comments (2)
- [Methods] Methods section: The specific neural audio codec (e.g., EnCodec, SoundStream) and its token rate / vocabulary size are not stated in the provided abstract and should be given explicitly for reproducibility.
- [Results] The abstract refers to 'baseline methods' without naming them; the results tables or text should list the exact competing systems and their configurations.
Simulated Author's Rebuttal
We thank the referee for their careful review and for emphasizing the need for detailed reporting of the subjective listening experiments, which are central to validating the perceptual benefits of TokenSE for CI users. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and Results section: The claim that 'subjective listening experiments with CI users indicate clear benefit in speech intelligibility under adverse noisy and reverberant environments' is load-bearing for the perceptual contribution yet supplies no participant count, test materials (e.g., sentence lists), presentation levels, CI processor settings, counter-balancing, or statistical tests (p-values, effect sizes). Because objective metrics such as PESQ/STOI are known to correlate only loosely with CI intelligibility, this omission prevents verification that token-index prediction translates into real-world benefit.
Authors: We agree that the abstract and Results section must supply these experimental details to substantiate the claim and enable verification, given the known loose correlation between objective metrics and CI intelligibility. The current manuscript version does not include participant count, test materials, presentation levels, CI processor settings, counter-balancing, or statistical tests in the abstract or the summarized Results. In the revised manuscript we will expand the Results section with a full description of the subjective protocol (number of CI participants, sentence lists or other materials, presentation levels, processor settings, counter-balancing, and statistical analysis with p-values and effect sizes) and will revise the abstract to reference these details. This directly addresses the concern and strengthens the perceptual contribution. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents TokenSE as a standard supervised sequence modeling task: a Mamba network is trained on paired degraded-to-clean neural codec token indices and evaluated on held-out in-domain and out-of-domain data plus separate subjective CI listening tests. No equations or claims reduce the reported performance gains to a fitted parameter, self-definition, or self-citation chain; the objective metrics and subjective intelligibility results are external benchmarks rather than tautological outputs of the model definition itself. The approach follows conventional ML training and evaluation practices without any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Stimuli and procedure The stimuli used in the subjective evaluation were selected from the OOD dataset. The root- mean-square (RMS) level of all utterances was normalized to approximately 65 dB, and all audio files were sampled at 16 kHz. The listening tests were conducted using a graphical user interface (GUI) implemented in Python. CI participants perfo...
work page 2080
-
[2]
Subjective evaluation a. Without reverberation. Figure 3 presents the mean WRR for CI recipients under the without-reverberation (noisy-only) condition at 0 dB and 5 dB SNR. At 0 dB SNR, the unprocessed mixture results in a very low WRR, indicating severe intelligibility degradation under highly adverse noise conditions. While Log-MMSE provides a moderate...
work page 2023
-
[3]
The misquote was retracted with an apology
Computational complexity We analyze the computational complexity of backbone architectures used in TokenSE, including Transformer and Mamba (Bi), by estimating the number of floating-point operations (FLOPs). Figure 7 illustrates the GFLOPs of both models at different sequence lengths. As shown in the figure, Mamba (Bi) consistently performs less GFLOPs c...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.