MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained Priors
Pith reviewed 2026-06-28 21:06 UTC · model grok-4.3
The pith
MindVoice reconstructs intelligible speech from EEG and MEG by recovering semantic content and acoustic attributes separately with pretrained priors before fusing them into generated waveforms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MindVoice disentangles reconstruction into two complementary pathways: one recovers high-level semantic content while the other estimates fine-grained acoustic attributes. These inferred representations are fused with powerful speech generation models and in-context voice cloning to synthesize natural and intelligible utterances. Extensive experiments on EEG and MEG demonstrate that MindVoice substantially outperforms existing methods on various metrics, showing that pretrained priors provide a principled way to bridge the gap between noisy neural recordings and natural speech.
What carries the argument
Dual complementary pathways that separately recover semantic content and acoustic attributes from neural signals using pretrained models, then fuse the results with speech generation models.
If this is right
- MindVoice substantially outperforms existing methods on various metrics when tested on EEG and MEG recordings.
- Pretrained priors supply the missing semantic and acoustic information that direct neural-to-speech mappings lack.
- The resulting outputs are natural and intelligible utterances rather than spectral-similar but unintelligible waveforms.
- The framework offers a route to scalable non-invasive speech brain-computer interfaces.
Where Pith is reading between the lines
- The same separation of semantic and acoustic recovery could be tested on other non-invasive signals such as fNIRS to check whether the performance gain generalizes.
- If the fusion step can be made faster, the approach might support real-time speech synthesis from ongoing brain recordings.
- The method supplies an indirect way to measure how much semantic versus acoustic detail survives in different neural recording modalities.
Load-bearing premise
Pretrained models can accurately recover high-level semantic content and fine-grained acoustic attributes from the incomplete and noisy information present in non-invasive neural recordings.
What would settle it
An ablation test in which removing the pretrained priors from both pathways drops word error rate or intelligibility scores to levels indistinguishable from earlier direct-mapping baselines on the same EEG and MEG datasets.
Figures
read the original abstract
Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and building safe, scalable speech brain-computer interfaces. Despite recent progress, intelligible reconstruction remains elusive, as non-invasive recordings are inherently noisy, spatially blurred, and only partially preserve information about perceived speech. Existing methods directly map neural activity to entangled speech representations before synthesizing waveforms with neural vocoders, resulting in spectral-similar but unintelligible results. To overcome these limitations, we introduce MindVoice, a neuro-to-speech reconstruction framework that uses pretrained models to compensate for the incomplete semantic and acoustic information in neural recordings. MindVoice disentangles reconstruction into two complementary pathways: one recovers high-level semantic content, while the other estimates fine-grained acoustic attributes. These inferred representations are then fused with powerful speech generation models and in-context voice cloning to synthesize natural and intelligible utterances. Extensive experiments on EEG and MEG demonstrate that MindVoice substantially outperforms existing methods on various metrics. These results show that pretrained priors provide a principled way to bridge the gap between noisy neural recordings and natural speech, highlighting a promising attempt for auditory neuroscience research and non-invasive speech brain-computer interfaces.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MindVoice, a neuro-to-speech reconstruction framework that disentangles the task into two complementary pathways—one recovering high-level semantic content from neural signals and the other estimating fine-grained acoustic attributes—then fuses these with pretrained speech generation models and in-context voice cloning to produce intelligible utterances from noisy, non-invasive EEG/MEG recordings. It claims this substantially outperforms existing direct-mapping methods on various metrics and shows that pretrained priors provide a principled bridge to natural speech.
Significance. If the empirical results hold after proper validation, the work would be significant for non-invasive speech BCIs and auditory neuroscience. The disentangled pathways plus fusion with external generative priors is a clear methodological step beyond direct neural-to-representation mapping, and the approach avoids circularity by relying on independent pretrained models.
major comments (2)
- [Abstract] Abstract: the central claim of substantial outperformance on EEG/MEG is stated without any quantitative metrics, error bars, dataset sizes, or statistical tests. This absence prevents verification of whether the reported gains are load-bearing or affected by post-hoc choices.
- [Methods/Results] Methods/Results (pathway fusion description): no ablation is reported that isolates the contribution of the neural-derived semantic and acoustic representations versus the pretrained speech generators alone. Without this test, the claim that neural signals meaningfully drive intelligible output (rather than the generators synthesizing from weak cues) remains unverified and is load-bearing for the central thesis.
minor comments (1)
- [Methods] Notation for the two pathways and their fusion step could be formalized with equations to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point by point below, agreeing where revisions are warranted to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of substantial outperformance on EEG/MEG is stated without any quantitative metrics, error bars, dataset sizes, or statistical tests. This absence prevents verification of whether the reported gains are load-bearing or affected by post-hoc choices.
Authors: We agree that the abstract would be strengthened by including key quantitative details. In the revised version, we will incorporate specific metrics (e.g., WER improvements, dataset sizes, and statistical tests) drawn from the results section to allow immediate assessment of the claims. revision: yes
-
Referee: [Methods/Results] Methods/Results (pathway fusion description): no ablation is reported that isolates the contribution of the neural-derived semantic and acoustic representations versus the pretrained speech generators alone. Without this test, the claim that neural signals meaningfully drive intelligible output (rather than the generators synthesizing from weak cues) remains unverified and is load-bearing for the central thesis.
Authors: This point is well-taken and directly addresses a load-bearing aspect of the central thesis. We will add a dedicated ablation experiment in the revised manuscript that compares the full pipeline against variants using only the pretrained generators (with neural inputs replaced by noise or null signals) to quantify the specific contribution of the neural-derived representations. revision: yes
Circularity Check
No significant circularity; derivation relies on external pretrained models
full rationale
The paper presents MindVoice as a framework that disentangles neural-to-representation pathways and fuses outputs with independent pretrained speech generation models and voice cloning. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or result to the input neural data by construction. The pretrained priors are described as external and independent of the EEG/MEG recordings, satisfying the condition for self-contained derivation against external benchmarks. The skeptic concern about ablation is a question of empirical validation, not circularity in the derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pretrained models can recover high-level semantic content and fine-grained acoustic attributes from incomplete neural signals.
Reference graph
Works this paper leans on
-
[1]
The cortical organization of speech processing.Nature reviews neuroscience, 8(5):393–402, 2007
Gregory Hickok and David Poeppel. The cortical organization of speech processing.Nature reviews neuroscience, 8(5):393–402, 2007
2007
-
[2]
Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing.Nature neuroscience, 12(6):718–724, 2009
Josef P Rauschecker and Sophie K Scott. Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing.Nature neuroscience, 12(6):718–724, 2009
2009
-
[3]
Hierarchical processing in spoken language comprehension
Matthew H Davis and Ingrid S Johnsrude. Hierarchical processing in spoken language comprehension. Journal of Neuroscience, 23(8):3423–3431, 2003
2003
-
[4]
Phonetic feature encoding in human superior temporal gyrus.Science, 343(6174):1006–1010, 2014
Nima Mesgarani, Connie Cheung, Keith Johnson, and Edward F Chang. Phonetic feature encoding in human superior temporal gyrus.Science, 343(6174):1006–1010, 2014
2014
-
[5]
An instantaneous voice-synthesis neuroprosthesis.Nature, 644(8075):145–152, 2025
Maitreyee Wairagkar, Nicholas S Card, Tyler Singer-Clark, Xianda Hou, Carrina Iacobacci, Lee M Miller, Leigh R Hochberg, David M Brandman, and Sergey D Stavisky. An instantaneous voice-synthesis neuroprosthesis.Nature, 644(8075):145–152, 2025
2025
-
[6]
Speech synthesis from neural decoding of spoken sentences.Nature, 568(7753):493–498, 2019
Gopala K Anumanchipalli, Josh Chartier, and Edward F Chang. Speech synthesis from neural decoding of spoken sentences.Nature, 568(7753):493–498, 2019
2019
-
[7]
Inner speech in motor cortex and implications for speech neuroprostheses.Cell, 188(17):4658–4673, 2025
Erin M Kunz, Benyamin Abramovich Krasa, Foram Kamdar, Donald T Avansino, Nick Hahn, Seonghyun Yoon, Akansha Singh, Samuel R Nason-Tomaszewski, Nicholas S Card, Justin J Jude, et al. Inner speech in motor cortex and implications for speech neuroprostheses.Cell, 188(17):4658–4673, 2025
2025
-
[8]
A streaming brain- to-voice neuroprosthesis to restore naturalistic communication.Nature neuroscience, 28(4):902–912, 2025
Kaylo T Littlejohn, Cheol Jun Cho, Jessie R Liu, Alexander B Silva, Bohan Yu, Vanessa R Anderson, Cady M Kurtz-Miott, Samantha Brosler, Anshul P Kashyap, Irina P Hallinan, et al. A streaming brain- to-voice neuroprosthesis to restore naturalistic communication.Nature neuroscience, 28(4):902–912, 2025
2025
-
[9]
A high-performance neuroprosthesis for speech decoding and avatar control.Nature, 620(7976):1037–1046, 2023
Sean L Metzger, Kaylo T Littlejohn, Alexander B Silva, David A Moses, Margaret P Seaton, Ran Wang, Maximilian E Dougherty, Jessie R Liu, Peter Wu, Michael A Berger, et al. A high-performance neuroprosthesis for speech decoding and avatar control.Nature, 620(7976):1037–1046, 2023
2023
-
[10]
Neural responses to uninterrupted natural speech can be extracted with precise temporal resolution.European journal of neuroscience, 31(1):189–193, 2010
Edmund C Lalor and John J Foxe. Neural responses to uninterrupted natural speech can be extracted with precise temporal resolution.European journal of neuroscience, 31(1):189–193, 2010
2010
-
[11]
Comparing the potential of meg and eeg to uncover brain tracking of speech temporal envelope.Neuroimage, 184:201–213, 2019
Florian Destoky, Morgane Philippe, Julie Bertels, Marie Verhasselt, Nicolas Coquelet, Marc Vander Ghinst, Vincent Wens, Xavier De Tiège, and Mathieu Bourguignon. Comparing the potential of meg and eeg to uncover brain tracking of speech temporal envelope.Neuroimage, 184:201–213, 2019
2019
-
[12]
Attentional selection in a cocktail party environment can be decoded from single-trial eeg.Cerebral cortex, 25(7):1697–1706, 2015
James A O’sullivan, Alan J Power, Nima Mesgarani, Siddharth Rajaram, John J Foxe, Barbara G Shinn- Cunningham, Malcolm Slaney, Shihab A Shamma, and Edmund C Lalor. Attentional selection in a cocktail party environment can be decoded from single-trial eeg.Cerebral cortex, 25(7):1697–1706, 2015. 10
2015
-
[13]
Decoding selective auditory attention with eeg using a transformer model.Methods, 204:410–417, 2022
Zihao Xu, Yanru Bai, Ran Zhao, Hongmei Hu, Guangjian Ni, and Dong Ming. Decoding selective auditory attention with eeg using a transformer model.Methods, 204:410–417, 2022
2022
-
[14]
Decoding of the speech envelope from eeg using the vlaai deep neural network.Scientific Reports, 13(1):812, 2023
Bernd Accou, Jonas Vanthornhout, Hugo Van hamme, and Tom Francart. Decoding of the speech envelope from eeg using the vlaai deep neural network.Scientific Reports, 13(1):812, 2023
2023
-
[15]
sound of silence
Rini A Sharon and Hema A Murthy. The “sound of silence” in eeg—cognitive voice activity detection. In Proceedings of Interspeech, pages 2767–2771, 2020
2020
-
[16]
Miran Özdogan, Gilad Landau, Gereon Elvers, Dulhan Jayalath, Pratik Somaiya, Francesco Mantegna, Mark Woolrich, and Oiwi Parker Jones. Libribrain: Over 50 hours of within-subject meg to improve speech decoding methods at scale.arXiv preprint arXiv:2506.02098, 2025
-
[17]
Gilad Landau, Miran Özdogan, Gereon Elvers, Francesco Mantegna, Pratik Somaiya, Dulhan Jayalath, Luisa Kurth, Teyun Kwon, Brendan Shillingford, Greg Farquhar, et al. The 2025 pnpl competition: Speech detection and phoneme classification in the libribrain dataset.arXiv preprint arXiv:2506.10165, 2025
-
[18]
Xabier de Zuazo, Ibon Saratxaga, and Eva Navas. Megconformer: Conformer-based meg decoder for robust speech and phoneme classification.arXiv preprint arXiv:2512.01443, 2025
-
[19]
Xiran Xu, Yujie Yan, Xihong Wu, and Jing Chen. Shine: Sequential hierarchical integration network for eeg and meg.arXiv preprint arXiv:2602.23960, 2026
-
[20]
Decoding speech perception from non-invasive brain recordings.Nature Machine Intelligence, 5(10):1097–1107, 2023
Alexandre Défossez, Charlotte Caucheteux, Jérémy Rapin, Ori Kabeli, and Jean-Rémi King. Decoding speech perception from non-invasive brain recordings.Nature Machine Intelligence, 5(10):1097–1107, 2023
2023
-
[21]
Cross-attention-guided wavenet for eeg-to-mel spectrogram reconstruction
Hao Li, Yuan Fang, Xueliang Zhang, Fei Chen, and Guanglai Gao. Cross-attention-guided wavenet for eeg-to-mel spectrogram reconstruction. InProceedings of Interspeech, 2024
2024
-
[22]
Towards decoding individual words from non-invasive brain recordings.Nature Communications, 16(1):10521, 2025
Stéphane d’Ascoli, Corentin Bel, Jérémy Rapin, Hubert Banville, Yohann Benchetrit, Christophe Pallier, and Jean-Rémi King. Towards decoding individual words from non-invasive brain recordings.Nature Communications, 16(1):10521, 2025
2025
-
[23]
Decoding the temporal dynamics of spoken word and nonword processing from eeg
Bob McMurray, McCall E Sarrett, Samantha Chiu, Alexis K Black, Alice Wang, Rebecca Canale, and Richard N Aslin. Decoding the temporal dynamics of spoken word and nonword processing from eeg. NeuroImage, 260:119457, 2022
2022
-
[24]
Toward fully-end-to-end listened speech decoding from eeg signals
Jihwan Lee, Aditya Kommineni, Tiantian Feng, Kleanthis Avramidis, Xuan Shi, Sudarsana Reddy Kadiri, and Shrikanth Narayanan. Toward fully-end-to-end listened speech decoding from eeg signals. In Interspeech, pages 1500–1504, 2024
2024
-
[25]
En- hancing listened speech decoding from eeg via parallel phoneme sequence prediction
Jihwan Lee, Tiantian Feng, Aditya Kommineni, Sudarsana Reddy Kadiri, and Shrikanth Narayanan. En- hancing listened speech decoding from eeg via parallel phoneme sequence prediction. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2025
2025
-
[26]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023
2023
-
[27]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova
Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, et al. Seamlessm4t: Massively multilingual & multimodal machine translation.arXiv preprint arXiv:2308.11596, 2023
-
[28]
Available: https://arxiv.org/abs/2312.05187
Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, et al. Seamless: Multilingual expressive and streaming speech translation.arXiv preprint arXiv:2312.05187, 2023
-
[29]
Funasr: A fundamental end-to-end speech recognition toolkit
Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, and Shiliang Zhang. Funasr: A fundamental end-to-end speech recognition toolkit. In Proceedings of Interspeech, pages 1593–1597, 2023
2023
-
[30]
Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, et al. Qwen3-asr technical report.arXiv preprint arXiv:2601.21337, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117, 2024. 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, JianZhao JianZhao, Kai Yu, and Xie Chen. F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 6255–6271, 2025
2025
-
[33]
Shijia Liao, Yuxuan Wang, Tianyu Li, Yifan Cheng, Ruoyi Zhang, Rongzhi Zhou, and Yijin Xing. Fish- speech: Leveraging large language models for advanced multilingual text-to-speech synthesis.arXiv preprint arXiv:2411.01156, 2024
-
[34]
Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, et al. Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[35]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Deepseek-r1 incentivizes reasoning in llms through reinforcement learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, 2025
2025
-
[37]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Brain decoding: toward real-time reconstruction of visual perception
Yohann Benchetrit, Hubert Banville, and Jean-Remi King. Brain decoding: toward real-time reconstruction of visual perception. InThe Twelfth International Conference on Learning Representations
-
[39]
Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding
Zijiao Chen, Jiaxin Qing, Tiange Xiang, Wan Lin Yue, and Juan Helen Zhou. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22710–22720, 2023
2023
-
[40]
Reconstructing the mind’s eye: fmri-to- image with contrastive learning and diffusion priors.Advances in Neural Information Processing Systems, 36:24705–24728, 2023
Paul Scotti, Atmadeep Banerjee, Jimmie Goode, Stepan Shabalin, Alex Nguyen, Aidan Dempster, Nathalie Verlinde, Elad Yundler, David Weisberg, Kenneth Norman, et al. Reconstructing the mind’s eye: fmri-to- image with contrastive learning and diffusion priors.Advances in Neural Information Processing Systems, 36:24705–24728, 2023
2023
-
[41]
Mindeye2: shared-subject models enable fmri-to-image with 1 hour of data
Paul S Scotti, Mihir Tripathy, Cesare Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A Norman, et al. Mindeye2: shared-subject models enable fmri-to-image with 1 hour of data. InProceedings of the 41st International Conference on Machine Learning, pages 44038–44059, 2024
2024
-
[42]
Visual decoding and reconstruction via eeg embeddings with guided diffusion
Dongyang Li, Chen Wei, Shiying Li, Jiachen Zou, and Quanying Liu. Visual decoding and reconstruction via eeg embeddings with guided diffusion. InProceedings of the 38th International Conference on Neural Information Processing Systems, pages 102822–102864, 2024
2024
-
[43]
Cinematic mindscapes: High-quality video reconstruction from brain activity.Advances in Neural Information Processing Systems, 36:24841–24858, 2023
Zijiao Chen, Jiaxin Qing, and Juan Helen Zhou. Cinematic mindscapes: High-quality video reconstruction from brain activity.Advances in Neural Information Processing Systems, 36:24841–24858, 2023
2023
-
[44]
Neuroclips: Towards high-fidelity and smooth fmri-to-video reconstruction.Advances in Neural Information Processing Systems, 37:51655–51683, 2024
Zixuan Gong, Guangyin Bao, Qi Zhang, Zhongwei Wan, Duoqian Miao, Shoujin Wang, Lei Zhu, Chang- wei Wang, Rongtao Xu, Liang Hu, et al. Neuroclips: Towards high-fidelity and smooth fmri-to-video reconstruction.Advances in Neural Information Processing Systems, 37:51655–51683, 2024
2024
-
[45]
Eeg2video: Towards decoding dynamic visual perception from eeg signals.Advances in Neural Information Processing Systems, 37:72245–72273, 2024
Xuan-Hao Liu, Yan-Kai Liu, Yansen Wang, Kan Ren, Hanwen Shi, Zilong Wang, Dongsheng Li, Bao- Liang Lu, and Wei-Long Zheng. Eeg2video: Towards decoding dynamic visual perception from eeg signals.Advances in Neural Information Processing Systems, 37:72245–72273, 2024
2024
-
[46]
Animate your thoughts: Reconstruction of dynamic natural vision from human brain activity
Yizhuo Lu, Changde Du, Chong Wang, Xuanliu Zhu, Liuyun Jiang, Xujin Li, and Huiguang He. Animate your thoughts: Reconstruction of dynamic natural vision from human brain activity. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[47]
Reanimating images using neural representations of dynamic stimuli
Jacob Yeung, Andrew F Luo, Gabriel Sarch, Margaret M Henderson, Deva Ramanan, and Michael J Tarr. Reanimating images using neural representations of dynamic stimuli. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5331–5343, 2025
2025
-
[48]
Hierarchical structure guides rapid linguistic predictions during naturalistic listening.PloS one, 14(1):e0207741, 2019
Jonathan R Brennan and John T Hale. Hierarchical structure guides rapid linguistic predictions during naturalistic listening.PloS one, 14(1):e0207741, 2019
2019
-
[49]
Introducing meg-masc a high-quality magneto-encephalography dataset for evaluating natural speech processing.Scientific data, 10(1):862, 2023
Laura Gwilliams, Graham Flick, Alec Marantz, Liina Pylkkänen, David Poeppel, and Jean-Rémi King. Introducing meg-masc a high-quality magneto-encephalography dataset for evaluating natural speech processing.Scientific data, 10(1):862, 2023. 12
2023
-
[50]
A high-performance speech neuroprosthesis.Nature, 620(7976):1031–1036, 2023
Francis R Willett, Erin M Kunz, Chaofei Fan, Donald T Avansino, Guy H Wilson, Eun Young Choi, Foram Kamdar, Matthew F Glasser, Leigh R Hochberg, Shaul Druckmann, et al. A high-performance speech neuroprosthesis.Nature, 620(7976):1031–1036, 2023
2023
-
[51]
Decoding intended speech with an intracortical brain-computer interface in a person with long-standing anarthria and locked-in syndrome.Cell Reports, 45(4), 2026
Justin J Jude, Stephanie Haro, Hadar Levi-Aharoni, Hiroaki Hashimoto, Alexander J Acosta, Nicholas S Card, Maitreyee Wairagkar, David M Brandman, Sergey D Stavisky, Ziv M Williams, et al. Decoding intended speech with an intracortical brain-computer interface in a person with long-standing anarthria and locked-in syndrome.Cell Reports, 45(4), 2026
2026
-
[52]
Towards voice reconstruction from eeg during imagined speech
Young-Eun Lee, Seo-Hyun Lee, Sang-Ho Kim, and Seong-Whan Lee. Towards voice reconstruction from eeg during imagined speech. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 6030–6038, 2023
2023
-
[53]
Imagined speech classification using eeg and deep learning.Bioengineering, 10(6):649, 2023
Mokhles M Abdulghani, Wilbur L Walters, and Khalid H Abed. Imagined speech classification using eeg and deep learning.Bioengineering, 10(6):649, 2023
2023
-
[54]
Motoshige Sato, Kenichi Tomeoka, Ilya Horiguchi, Kai Arulkumaran, Ryota Kanai, and Shuntaro Sasai. Scaling law in neural data: Non-invasive speech decoding with 175 hours of eeg data.arXiv preprint arXiv:2407.07595, 2024
-
[55]
Mad: Multi-alignment meg-to-text decoding.arXiv preprint arXiv:2406.01512, 2024
Yiqian Yang, Hyejeong Jo, Yiqun Duan, Qiang Zhang, Jinni Zhou, Xuming Hu, Won Hee Lee, Renjing Xu, and Hui Xiong. Mad: Multi-alignment meg-to-text decoding.arXiv preprint arXiv:2406.01512, 2024
-
[56]
Brainecho: Semantic brain signal decoding through vector-quantized spectrogram reconstruction for whisper-enhanced text generation
Jilong Li, Zhenxi Song, Jiaqi Wang, Meishan Zhang, Honghai Liu, Min Zhang, and Zhiguo Zhang. Brainecho: Semantic brain signal decoding through vector-quantized spectrogram reconstruction for whisper-enhanced text generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 2762–2778, 2025
2025
-
[57]
Neuspeech: Decode neural signal as speech
Yiqian Yang, Yiqun Duan, Qiang Zhang, Hyejeong Jo, Jinni Zhou, Won Hee Lee, Renjing Xu, and Hui Xiong. Neuspeech: Decode neural signal as speech. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 6636–6640. IEEE, 2026
2026
-
[58]
Towards sentence level imagined speech generation from eeg signals
Sparsh Rastogi, Harsh Dadwal, Khushboo Modi, Jatin Bedi, and Jasmeet Singh. Towards sentence level imagined speech generation from eeg signals. InProc. Interspeech 2025, pages 5558–5562, 2025
2025
-
[59]
Are eeg-to-text models working?arXiv preprint arXiv:2405.06459, 2024
Hyejeong Jo, Yiqian Yang, Juhyeok Han, Yiqun Duan, Hui Xiong, and Won Hee Lee. Are eeg-to-text models working?arXiv preprint arXiv:2405.06459, 2024
-
[60]
Deep speech 2: End-to-end speech recognition in english and mandarin
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. InInternational conference on machine learning, pages 173–182. PMLR, 2016
2016
-
[61]
Conformer: Convolution-augmented transformer for speech recognition
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented transformer for speech recognition. InProc. Interspeech 2020, pages 5036–5040, 2020
2020
-
[62]
wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020
2020
-
[63]
Deep voice: Real-time neural text-to-speech
Sercan Ö Arık, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, et al. Deep voice: Real-time neural text-to-speech. In International conference on machine learning, pages 195–204. PMLR, 2017
2017
-
[64]
Fastspeech: Fast, robust and controllable text to speech.Advances in neural information processing systems, 32, 2019
Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech.Advances in neural information processing systems, 32, 2019
2019
-
[65]
Fastspeech 2: Fast and high-quality end-to-end text to speech
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. InInternational Conference on Learning Representations
-
[66]
Glow-tts: A generative flow for text-to- speech via monotonic alignment search.Advances in Neural Information Processing Systems, 33:8067– 8077, 2020
Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. Glow-tts: A generative flow for text-to- speech via monotonic alignment search.Advances in Neural Information Processing Systems, 33:8067– 8077, 2020
2020
-
[67]
Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech
Jaehyeon Kim, Jungil Kong, and Juhee Son. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. InInternational conference on machine learning, pages 5530–5540. PMLR, 2021. 13
2021
-
[68]
Bigvgan: A universal neural vocoder with large-scale training
Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. Bigvgan: A universal neural vocoder with large-scale training. InThe Eleventh International Conference on Learning Representations, 2023
2023
-
[69]
Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis.Advances in neural information processing systems, 33:17022–17033, 2020
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis.Advances in neural information processing systems, 33:17022–17033, 2020
2020
-
[70]
Mel-cepstral distance measure for objective speech quality assessment
Robert Kubichek. Mel-cepstral distance measure for objective speech quality assessment. InProceedings of IEEE pacific rim conference on communications computers and signal processing, volume 1, pages 125–128. IEEE, 1993
1993
-
[71]
Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021
2021
-
[72]
Bertscore: Evaluating text generation with bert
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. InInternational Conference on Learning Representations, 2020
2020
-
[73]
Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022
2022
-
[74]
Utmos: Utokyo-sarulab system for voicemos challenge 2022.Interspeech, 2022
Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. Utmos: Utokyo-sarulab system for voicemos challenge 2022.Interspeech, 2022. 14 A Additional Experiment Settings A.1 Datasets and Preprocessing Brennan EEG dataset.Brennan EEG dataset is a naturalistic listening EEG dataset introduced by Brennan and Hale...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.