pith. sign in

arxiv: 2605.31173 · v1 · pith:YEETSPKVnew · submitted 2026-05-29 · 💻 cs.SD · cs.AI

MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained Priors

Pith reviewed 2026-06-28 21:06 UTC · model grok-4.3

classification 💻 cs.SD cs.AI
keywords speech reconstructionEEGMEGpretrained modelsbrain-computer interfaceneural signalssemantic recoveryacoustic estimation
0
0 comments X

The pith

MindVoice reconstructs intelligible speech from EEG and MEG by recovering semantic content and acoustic attributes separately with pretrained priors before fusing them into generated waveforms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that existing direct-mapping approaches produce spectral-similar but unintelligible speech because they cannot handle the incomplete and noisy information in non-invasive recordings. It shows that splitting the task into two pathways—one for high-level semantics and one for fine-grained acoustics—then combining the outputs with pretrained speech generation models and voice cloning yields natural, intelligible results. A sympathetic reader would care because this offers a route to practical non-invasive speech brain-computer interfaces and better tools for studying auditory perception. The experiments on EEG and MEG data demonstrate clear gains over prior methods. The central mechanism is the use of pretrained priors to fill gaps that raw neural signals leave open.

Core claim

MindVoice disentangles reconstruction into two complementary pathways: one recovers high-level semantic content while the other estimates fine-grained acoustic attributes. These inferred representations are fused with powerful speech generation models and in-context voice cloning to synthesize natural and intelligible utterances. Extensive experiments on EEG and MEG demonstrate that MindVoice substantially outperforms existing methods on various metrics, showing that pretrained priors provide a principled way to bridge the gap between noisy neural recordings and natural speech.

What carries the argument

Dual complementary pathways that separately recover semantic content and acoustic attributes from neural signals using pretrained models, then fuse the results with speech generation models.

If this is right

  • MindVoice substantially outperforms existing methods on various metrics when tested on EEG and MEG recordings.
  • Pretrained priors supply the missing semantic and acoustic information that direct neural-to-speech mappings lack.
  • The resulting outputs are natural and intelligible utterances rather than spectral-similar but unintelligible waveforms.
  • The framework offers a route to scalable non-invasive speech brain-computer interfaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of semantic and acoustic recovery could be tested on other non-invasive signals such as fNIRS to check whether the performance gain generalizes.
  • If the fusion step can be made faster, the approach might support real-time speech synthesis from ongoing brain recordings.
  • The method supplies an indirect way to measure how much semantic versus acoustic detail survives in different neural recording modalities.

Load-bearing premise

Pretrained models can accurately recover high-level semantic content and fine-grained acoustic attributes from the incomplete and noisy information present in non-invasive neural recordings.

What would settle it

An ablation test in which removing the pretrained priors from both pathways drops word error rate or intelligibility scores to levels indistinguishable from earlier direct-mapping baselines on the same EEG and MEG datasets.

Figures

Figures reproduced from arXiv: 2605.31173 by Guangyin Bao, Jianfeng Feng, Taiping Zeng, Xiangyang Xue.

Figure 1
Figure 1. Figure 1: Decoding of auditory-evoked non-invasive neural recodings. a, most extensive researches focus on decoding paraspeech information, whereas our goal is to directly reconstruct continuous and intelligible speech waveforms. b, existing EEG-to-speech methods align EEG with entangled speech representations and reconstruct speech using neural vocoders. c, our framework adopts a dual-stream architecture comprising… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed MindVoice framework. embedder, a speech vector-quantized autoencoder, and a neuro-to-semantic aligner, which together generate the reconstructed text tokens Tˆ. Neural Signal Embedder. We treat EEG/MEG signals with temporal and channel dimensions as single-channel images, i.e., Xi ∈ R 1×t×c . Under this formulation, cascaded CNNs progressively compress the input and learn represent… view at source ↗
Figure 3
Figure 3. Figure 3: Mel spectrogram and transcription results. a, comparison of mel spectrograms obtained by different methods with the ground-truth. b, transcription results of ground-truth speech, FESDE reconstructed speech, and our reconstructed speech using the Qwen3-ASR. More results can be found in Appendix B.1 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Analyses of model interpretability and preference. a, channel heatmap for semantic- and acoustic￾level stream. Each pixel represents a neural signal channel. b, sentence length regression analysis between reconstructions and ground truth. c, baseline-scaled BERTScore-F1 across sentence length groups. could preserve the main sentence structure, partial keywords, and grammatical patterns, although it may sti… view at source ↗
Figure 5
Figure 5. Figure 5: Additional qualitative results on the Brennan EEG dataset. Randomly select. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional qualitative results on the Gwilliams MEG dataset. Randomly select. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Fine-grained linguistic preference analysis of MEG-to-text reconstruction. a, ground￾truth and reconstructed sentence-length distributions. b, precision and recall of discourse/relation markers. c, POS-specific precision, recall, and F1. d, zipf frequency distributions of content words in five groups: GT, Recon, Hit, Missed, and Hallucinated. e–g, associations between BERTScore and function-word recall, ha… view at source ↗
read the original abstract

Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and building safe, scalable speech brain-computer interfaces. Despite recent progress, intelligible reconstruction remains elusive, as non-invasive recordings are inherently noisy, spatially blurred, and only partially preserve information about perceived speech. Existing methods directly map neural activity to entangled speech representations before synthesizing waveforms with neural vocoders, resulting in spectral-similar but unintelligible results. To overcome these limitations, we introduce MindVoice, a neuro-to-speech reconstruction framework that uses pretrained models to compensate for the incomplete semantic and acoustic information in neural recordings. MindVoice disentangles reconstruction into two complementary pathways: one recovers high-level semantic content, while the other estimates fine-grained acoustic attributes. These inferred representations are then fused with powerful speech generation models and in-context voice cloning to synthesize natural and intelligible utterances. Extensive experiments on EEG and MEG demonstrate that MindVoice substantially outperforms existing methods on various metrics. These results show that pretrained priors provide a principled way to bridge the gap between noisy neural recordings and natural speech, highlighting a promising attempt for auditory neuroscience research and non-invasive speech brain-computer interfaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MindVoice, a neuro-to-speech reconstruction framework that disentangles the task into two complementary pathways—one recovering high-level semantic content from neural signals and the other estimating fine-grained acoustic attributes—then fuses these with pretrained speech generation models and in-context voice cloning to produce intelligible utterances from noisy, non-invasive EEG/MEG recordings. It claims this substantially outperforms existing direct-mapping methods on various metrics and shows that pretrained priors provide a principled bridge to natural speech.

Significance. If the empirical results hold after proper validation, the work would be significant for non-invasive speech BCIs and auditory neuroscience. The disentangled pathways plus fusion with external generative priors is a clear methodological step beyond direct neural-to-representation mapping, and the approach avoids circularity by relying on independent pretrained models.

major comments (2)
  1. [Abstract] Abstract: the central claim of substantial outperformance on EEG/MEG is stated without any quantitative metrics, error bars, dataset sizes, or statistical tests. This absence prevents verification of whether the reported gains are load-bearing or affected by post-hoc choices.
  2. [Methods/Results] Methods/Results (pathway fusion description): no ablation is reported that isolates the contribution of the neural-derived semantic and acoustic representations versus the pretrained speech generators alone. Without this test, the claim that neural signals meaningfully drive intelligible output (rather than the generators synthesizing from weak cues) remains unverified and is load-bearing for the central thesis.
minor comments (1)
  1. [Methods] Notation for the two pathways and their fusion step could be formalized with equations to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, agreeing where revisions are warranted to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of substantial outperformance on EEG/MEG is stated without any quantitative metrics, error bars, dataset sizes, or statistical tests. This absence prevents verification of whether the reported gains are load-bearing or affected by post-hoc choices.

    Authors: We agree that the abstract would be strengthened by including key quantitative details. In the revised version, we will incorporate specific metrics (e.g., WER improvements, dataset sizes, and statistical tests) drawn from the results section to allow immediate assessment of the claims. revision: yes

  2. Referee: [Methods/Results] Methods/Results (pathway fusion description): no ablation is reported that isolates the contribution of the neural-derived semantic and acoustic representations versus the pretrained speech generators alone. Without this test, the claim that neural signals meaningfully drive intelligible output (rather than the generators synthesizing from weak cues) remains unverified and is load-bearing for the central thesis.

    Authors: This point is well-taken and directly addresses a load-bearing aspect of the central thesis. We will add a dedicated ablation experiment in the revised manuscript that compares the full pipeline against variants using only the pretrained generators (with neural inputs replaced by noise or null signals) to quantify the specific contribution of the neural-derived representations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pretrained models

full rationale

The paper presents MindVoice as a framework that disentangles neural-to-representation pathways and fuses outputs with independent pretrained speech generation models and voice cloning. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or result to the input neural data by construction. The pretrained priors are described as external and independent of the EEG/MEG recordings, satisfying the condition for self-contained derivation against external benchmarks. The skeptic concern about ablation is a question of empirical validation, not circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pretrained speech and language models contain sufficient semantic and acoustic knowledge to compensate for information lost in non-invasive neural recordings. No free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Pretrained models can recover high-level semantic content and fine-grained acoustic attributes from incomplete neural signals.
    This premise is required for the two-pathway design to succeed and is invoked when the abstract states that the pathways 'recover' and 'estimate' the missing information.

pith-pipeline@v0.9.1-grok · 5737 in / 1440 out tokens · 23321 ms · 2026-06-28T21:06:32.174568+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 15 canonical work pages · 5 internal anchors

  1. [1]

    The cortical organization of speech processing.Nature reviews neuroscience, 8(5):393–402, 2007

    Gregory Hickok and David Poeppel. The cortical organization of speech processing.Nature reviews neuroscience, 8(5):393–402, 2007

  2. [2]

    Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing.Nature neuroscience, 12(6):718–724, 2009

    Josef P Rauschecker and Sophie K Scott. Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing.Nature neuroscience, 12(6):718–724, 2009

  3. [3]

    Hierarchical processing in spoken language comprehension

    Matthew H Davis and Ingrid S Johnsrude. Hierarchical processing in spoken language comprehension. Journal of Neuroscience, 23(8):3423–3431, 2003

  4. [4]

    Phonetic feature encoding in human superior temporal gyrus.Science, 343(6174):1006–1010, 2014

    Nima Mesgarani, Connie Cheung, Keith Johnson, and Edward F Chang. Phonetic feature encoding in human superior temporal gyrus.Science, 343(6174):1006–1010, 2014

  5. [5]

    An instantaneous voice-synthesis neuroprosthesis.Nature, 644(8075):145–152, 2025

    Maitreyee Wairagkar, Nicholas S Card, Tyler Singer-Clark, Xianda Hou, Carrina Iacobacci, Lee M Miller, Leigh R Hochberg, David M Brandman, and Sergey D Stavisky. An instantaneous voice-synthesis neuroprosthesis.Nature, 644(8075):145–152, 2025

  6. [6]

    Speech synthesis from neural decoding of spoken sentences.Nature, 568(7753):493–498, 2019

    Gopala K Anumanchipalli, Josh Chartier, and Edward F Chang. Speech synthesis from neural decoding of spoken sentences.Nature, 568(7753):493–498, 2019

  7. [7]

    Inner speech in motor cortex and implications for speech neuroprostheses.Cell, 188(17):4658–4673, 2025

    Erin M Kunz, Benyamin Abramovich Krasa, Foram Kamdar, Donald T Avansino, Nick Hahn, Seonghyun Yoon, Akansha Singh, Samuel R Nason-Tomaszewski, Nicholas S Card, Justin J Jude, et al. Inner speech in motor cortex and implications for speech neuroprostheses.Cell, 188(17):4658–4673, 2025

  8. [8]

    A streaming brain- to-voice neuroprosthesis to restore naturalistic communication.Nature neuroscience, 28(4):902–912, 2025

    Kaylo T Littlejohn, Cheol Jun Cho, Jessie R Liu, Alexander B Silva, Bohan Yu, Vanessa R Anderson, Cady M Kurtz-Miott, Samantha Brosler, Anshul P Kashyap, Irina P Hallinan, et al. A streaming brain- to-voice neuroprosthesis to restore naturalistic communication.Nature neuroscience, 28(4):902–912, 2025

  9. [9]

    A high-performance neuroprosthesis for speech decoding and avatar control.Nature, 620(7976):1037–1046, 2023

    Sean L Metzger, Kaylo T Littlejohn, Alexander B Silva, David A Moses, Margaret P Seaton, Ran Wang, Maximilian E Dougherty, Jessie R Liu, Peter Wu, Michael A Berger, et al. A high-performance neuroprosthesis for speech decoding and avatar control.Nature, 620(7976):1037–1046, 2023

  10. [10]

    Neural responses to uninterrupted natural speech can be extracted with precise temporal resolution.European journal of neuroscience, 31(1):189–193, 2010

    Edmund C Lalor and John J Foxe. Neural responses to uninterrupted natural speech can be extracted with precise temporal resolution.European journal of neuroscience, 31(1):189–193, 2010

  11. [11]

    Comparing the potential of meg and eeg to uncover brain tracking of speech temporal envelope.Neuroimage, 184:201–213, 2019

    Florian Destoky, Morgane Philippe, Julie Bertels, Marie Verhasselt, Nicolas Coquelet, Marc Vander Ghinst, Vincent Wens, Xavier De Tiège, and Mathieu Bourguignon. Comparing the potential of meg and eeg to uncover brain tracking of speech temporal envelope.Neuroimage, 184:201–213, 2019

  12. [12]

    Attentional selection in a cocktail party environment can be decoded from single-trial eeg.Cerebral cortex, 25(7):1697–1706, 2015

    James A O’sullivan, Alan J Power, Nima Mesgarani, Siddharth Rajaram, John J Foxe, Barbara G Shinn- Cunningham, Malcolm Slaney, Shihab A Shamma, and Edmund C Lalor. Attentional selection in a cocktail party environment can be decoded from single-trial eeg.Cerebral cortex, 25(7):1697–1706, 2015. 10

  13. [13]

    Decoding selective auditory attention with eeg using a transformer model.Methods, 204:410–417, 2022

    Zihao Xu, Yanru Bai, Ran Zhao, Hongmei Hu, Guangjian Ni, and Dong Ming. Decoding selective auditory attention with eeg using a transformer model.Methods, 204:410–417, 2022

  14. [14]

    Decoding of the speech envelope from eeg using the vlaai deep neural network.Scientific Reports, 13(1):812, 2023

    Bernd Accou, Jonas Vanthornhout, Hugo Van hamme, and Tom Francart. Decoding of the speech envelope from eeg using the vlaai deep neural network.Scientific Reports, 13(1):812, 2023

  15. [15]

    sound of silence

    Rini A Sharon and Hema A Murthy. The “sound of silence” in eeg—cognitive voice activity detection. In Proceedings of Interspeech, pages 2767–2771, 2020

  16. [16]

    Libribrain: Over 50 hours of within-subject meg to improve speech decoding methods at scale.arXiv preprint arXiv:2506.02098, 2025

    Miran Özdogan, Gilad Landau, Gereon Elvers, Dulhan Jayalath, Pratik Somaiya, Francesco Mantegna, Mark Woolrich, and Oiwi Parker Jones. Libribrain: Over 50 hours of within-subject meg to improve speech decoding methods at scale.arXiv preprint arXiv:2506.02098, 2025

  17. [17]

    The 2025 pnpl competition: Speech detection and phoneme classification in the libribrain dataset.arXiv preprint arXiv:2506.10165, 2025

    Gilad Landau, Miran Özdogan, Gereon Elvers, Francesco Mantegna, Pratik Somaiya, Dulhan Jayalath, Luisa Kurth, Teyun Kwon, Brendan Shillingford, Greg Farquhar, et al. The 2025 pnpl competition: Speech detection and phoneme classification in the libribrain dataset.arXiv preprint arXiv:2506.10165, 2025

  18. [18]

    Megconformer: Conformer-based meg decoder for robust speech and phoneme classification.arXiv preprint arXiv:2512.01443, 2025

    Xabier de Zuazo, Ibon Saratxaga, and Eva Navas. Megconformer: Conformer-based meg decoder for robust speech and phoneme classification.arXiv preprint arXiv:2512.01443, 2025

  19. [19]

    Shine: Sequential hierarchical integration network for eeg and meg.arXiv preprint arXiv:2602.23960, 2026

    Xiran Xu, Yujie Yan, Xihong Wu, and Jing Chen. Shine: Sequential hierarchical integration network for eeg and meg.arXiv preprint arXiv:2602.23960, 2026

  20. [20]

    Decoding speech perception from non-invasive brain recordings.Nature Machine Intelligence, 5(10):1097–1107, 2023

    Alexandre Défossez, Charlotte Caucheteux, Jérémy Rapin, Ori Kabeli, and Jean-Rémi King. Decoding speech perception from non-invasive brain recordings.Nature Machine Intelligence, 5(10):1097–1107, 2023

  21. [21]

    Cross-attention-guided wavenet for eeg-to-mel spectrogram reconstruction

    Hao Li, Yuan Fang, Xueliang Zhang, Fei Chen, and Guanglai Gao. Cross-attention-guided wavenet for eeg-to-mel spectrogram reconstruction. InProceedings of Interspeech, 2024

  22. [22]

    Towards decoding individual words from non-invasive brain recordings.Nature Communications, 16(1):10521, 2025

    Stéphane d’Ascoli, Corentin Bel, Jérémy Rapin, Hubert Banville, Yohann Benchetrit, Christophe Pallier, and Jean-Rémi King. Towards decoding individual words from non-invasive brain recordings.Nature Communications, 16(1):10521, 2025

  23. [23]

    Decoding the temporal dynamics of spoken word and nonword processing from eeg

    Bob McMurray, McCall E Sarrett, Samantha Chiu, Alexis K Black, Alice Wang, Rebecca Canale, and Richard N Aslin. Decoding the temporal dynamics of spoken word and nonword processing from eeg. NeuroImage, 260:119457, 2022

  24. [24]

    Toward fully-end-to-end listened speech decoding from eeg signals

    Jihwan Lee, Aditya Kommineni, Tiantian Feng, Kleanthis Avramidis, Xuan Shi, Sudarsana Reddy Kadiri, and Shrikanth Narayanan. Toward fully-end-to-end listened speech decoding from eeg signals. In Interspeech, pages 1500–1504, 2024

  25. [25]

    En- hancing listened speech decoding from eeg via parallel phoneme sequence prediction

    Jihwan Lee, Tiantian Feng, Aditya Kommineni, Sudarsana Reddy Kadiri, and Shrikanth Narayanan. En- hancing listened speech decoding from eeg via parallel phoneme sequence prediction. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2025

  26. [26]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

  27. [27]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

    Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, et al. Seamlessm4t: Massively multilingual & multimodal machine translation.arXiv preprint arXiv:2308.11596, 2023

  28. [28]

    Available: https://arxiv.org/abs/2312.05187

    Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, et al. Seamless: Multilingual expressive and streaming speech translation.arXiv preprint arXiv:2312.05187, 2023

  29. [29]

    Funasr: A fundamental end-to-end speech recognition toolkit

    Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, and Shiliang Zhang. Funasr: A fundamental end-to-end speech recognition toolkit. In Proceedings of Interspeech, pages 1593–1597, 2023

  30. [30]

    Qwen3-ASR Technical Report

    Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, et al. Qwen3-asr technical report.arXiv preprint arXiv:2601.21337, 2026

  31. [31]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117, 2024. 11

  32. [32]

    F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching

    Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, JianZhao JianZhao, Kai Yu, and Xie Chen. F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 6255–6271, 2025

  33. [33]

    Fish-speech: Leveraging large language models for advanced multilingual text-to-speech synthesis.arXiv preprint arXiv:2411.01156, 2024

    Shijia Liao, Yuxuan Wang, Tianyu Li, Yifan Cheng, Ruoyi Zhang, Rongzhi Zhou, and Yijin Xing. Fish- speech: Leveraging large language models for advanced multilingual text-to-speech synthesis.arXiv preprint arXiv:2411.01156, 2024

  34. [34]

    Qwen3-TTS Technical Report

    Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, et al. Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026

  35. [35]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  36. [36]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, 2025

  37. [37]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  38. [38]

    Brain decoding: toward real-time reconstruction of visual perception

    Yohann Benchetrit, Hubert Banville, and Jean-Remi King. Brain decoding: toward real-time reconstruction of visual perception. InThe Twelfth International Conference on Learning Representations

  39. [39]

    Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding

    Zijiao Chen, Jiaxin Qing, Tiange Xiang, Wan Lin Yue, and Juan Helen Zhou. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22710–22720, 2023

  40. [40]

    Reconstructing the mind’s eye: fmri-to- image with contrastive learning and diffusion priors.Advances in Neural Information Processing Systems, 36:24705–24728, 2023

    Paul Scotti, Atmadeep Banerjee, Jimmie Goode, Stepan Shabalin, Alex Nguyen, Aidan Dempster, Nathalie Verlinde, Elad Yundler, David Weisberg, Kenneth Norman, et al. Reconstructing the mind’s eye: fmri-to- image with contrastive learning and diffusion priors.Advances in Neural Information Processing Systems, 36:24705–24728, 2023

  41. [41]

    Mindeye2: shared-subject models enable fmri-to-image with 1 hour of data

    Paul S Scotti, Mihir Tripathy, Cesare Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A Norman, et al. Mindeye2: shared-subject models enable fmri-to-image with 1 hour of data. InProceedings of the 41st International Conference on Machine Learning, pages 44038–44059, 2024

  42. [42]

    Visual decoding and reconstruction via eeg embeddings with guided diffusion

    Dongyang Li, Chen Wei, Shiying Li, Jiachen Zou, and Quanying Liu. Visual decoding and reconstruction via eeg embeddings with guided diffusion. InProceedings of the 38th International Conference on Neural Information Processing Systems, pages 102822–102864, 2024

  43. [43]

    Cinematic mindscapes: High-quality video reconstruction from brain activity.Advances in Neural Information Processing Systems, 36:24841–24858, 2023

    Zijiao Chen, Jiaxin Qing, and Juan Helen Zhou. Cinematic mindscapes: High-quality video reconstruction from brain activity.Advances in Neural Information Processing Systems, 36:24841–24858, 2023

  44. [44]

    Neuroclips: Towards high-fidelity and smooth fmri-to-video reconstruction.Advances in Neural Information Processing Systems, 37:51655–51683, 2024

    Zixuan Gong, Guangyin Bao, Qi Zhang, Zhongwei Wan, Duoqian Miao, Shoujin Wang, Lei Zhu, Chang- wei Wang, Rongtao Xu, Liang Hu, et al. Neuroclips: Towards high-fidelity and smooth fmri-to-video reconstruction.Advances in Neural Information Processing Systems, 37:51655–51683, 2024

  45. [45]

    Eeg2video: Towards decoding dynamic visual perception from eeg signals.Advances in Neural Information Processing Systems, 37:72245–72273, 2024

    Xuan-Hao Liu, Yan-Kai Liu, Yansen Wang, Kan Ren, Hanwen Shi, Zilong Wang, Dongsheng Li, Bao- Liang Lu, and Wei-Long Zheng. Eeg2video: Towards decoding dynamic visual perception from eeg signals.Advances in Neural Information Processing Systems, 37:72245–72273, 2024

  46. [46]

    Animate your thoughts: Reconstruction of dynamic natural vision from human brain activity

    Yizhuo Lu, Changde Du, Chong Wang, Xuanliu Zhu, Liuyun Jiang, Xujin Li, and Huiguang He. Animate your thoughts: Reconstruction of dynamic natural vision from human brain activity. InThe Thirteenth International Conference on Learning Representations, 2025

  47. [47]

    Reanimating images using neural representations of dynamic stimuli

    Jacob Yeung, Andrew F Luo, Gabriel Sarch, Margaret M Henderson, Deva Ramanan, and Michael J Tarr. Reanimating images using neural representations of dynamic stimuli. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5331–5343, 2025

  48. [48]

    Hierarchical structure guides rapid linguistic predictions during naturalistic listening.PloS one, 14(1):e0207741, 2019

    Jonathan R Brennan and John T Hale. Hierarchical structure guides rapid linguistic predictions during naturalistic listening.PloS one, 14(1):e0207741, 2019

  49. [49]

    Introducing meg-masc a high-quality magneto-encephalography dataset for evaluating natural speech processing.Scientific data, 10(1):862, 2023

    Laura Gwilliams, Graham Flick, Alec Marantz, Liina Pylkkänen, David Poeppel, and Jean-Rémi King. Introducing meg-masc a high-quality magneto-encephalography dataset for evaluating natural speech processing.Scientific data, 10(1):862, 2023. 12

  50. [50]

    A high-performance speech neuroprosthesis.Nature, 620(7976):1031–1036, 2023

    Francis R Willett, Erin M Kunz, Chaofei Fan, Donald T Avansino, Guy H Wilson, Eun Young Choi, Foram Kamdar, Matthew F Glasser, Leigh R Hochberg, Shaul Druckmann, et al. A high-performance speech neuroprosthesis.Nature, 620(7976):1031–1036, 2023

  51. [51]

    Decoding intended speech with an intracortical brain-computer interface in a person with long-standing anarthria and locked-in syndrome.Cell Reports, 45(4), 2026

    Justin J Jude, Stephanie Haro, Hadar Levi-Aharoni, Hiroaki Hashimoto, Alexander J Acosta, Nicholas S Card, Maitreyee Wairagkar, David M Brandman, Sergey D Stavisky, Ziv M Williams, et al. Decoding intended speech with an intracortical brain-computer interface in a person with long-standing anarthria and locked-in syndrome.Cell Reports, 45(4), 2026

  52. [52]

    Towards voice reconstruction from eeg during imagined speech

    Young-Eun Lee, Seo-Hyun Lee, Sang-Ho Kim, and Seong-Whan Lee. Towards voice reconstruction from eeg during imagined speech. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 6030–6038, 2023

  53. [53]

    Imagined speech classification using eeg and deep learning.Bioengineering, 10(6):649, 2023

    Mokhles M Abdulghani, Wilbur L Walters, and Khalid H Abed. Imagined speech classification using eeg and deep learning.Bioengineering, 10(6):649, 2023

  54. [54]

    Scaling law in neural data: Non-invasive speech decoding with 175 hours of eeg data.arXiv preprint arXiv:2407.07595, 2024

    Motoshige Sato, Kenichi Tomeoka, Ilya Horiguchi, Kai Arulkumaran, Ryota Kanai, and Shuntaro Sasai. Scaling law in neural data: Non-invasive speech decoding with 175 hours of eeg data.arXiv preprint arXiv:2407.07595, 2024

  55. [55]

    Mad: Multi-alignment meg-to-text decoding.arXiv preprint arXiv:2406.01512, 2024

    Yiqian Yang, Hyejeong Jo, Yiqun Duan, Qiang Zhang, Jinni Zhou, Xuming Hu, Won Hee Lee, Renjing Xu, and Hui Xiong. Mad: Multi-alignment meg-to-text decoding.arXiv preprint arXiv:2406.01512, 2024

  56. [56]

    Brainecho: Semantic brain signal decoding through vector-quantized spectrogram reconstruction for whisper-enhanced text generation

    Jilong Li, Zhenxi Song, Jiaqi Wang, Meishan Zhang, Honghai Liu, Min Zhang, and Zhiguo Zhang. Brainecho: Semantic brain signal decoding through vector-quantized spectrogram reconstruction for whisper-enhanced text generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 2762–2778, 2025

  57. [57]

    Neuspeech: Decode neural signal as speech

    Yiqian Yang, Yiqun Duan, Qiang Zhang, Hyejeong Jo, Jinni Zhou, Won Hee Lee, Renjing Xu, and Hui Xiong. Neuspeech: Decode neural signal as speech. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 6636–6640. IEEE, 2026

  58. [58]

    Towards sentence level imagined speech generation from eeg signals

    Sparsh Rastogi, Harsh Dadwal, Khushboo Modi, Jatin Bedi, and Jasmeet Singh. Towards sentence level imagined speech generation from eeg signals. InProc. Interspeech 2025, pages 5558–5562, 2025

  59. [59]

    Are eeg-to-text models working?arXiv preprint arXiv:2405.06459, 2024

    Hyejeong Jo, Yiqian Yang, Juhyeok Han, Yiqun Duan, Hui Xiong, and Won Hee Lee. Are eeg-to-text models working?arXiv preprint arXiv:2405.06459, 2024

  60. [60]

    Deep speech 2: End-to-end speech recognition in english and mandarin

    Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. InInternational conference on machine learning, pages 173–182. PMLR, 2016

  61. [61]

    Conformer: Convolution-augmented transformer for speech recognition

    Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented transformer for speech recognition. InProc. Interspeech 2020, pages 5036–5040, 2020

  62. [62]

    wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

  63. [63]

    Deep voice: Real-time neural text-to-speech

    Sercan Ö Arık, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, et al. Deep voice: Real-time neural text-to-speech. In International conference on machine learning, pages 195–204. PMLR, 2017

  64. [64]

    Fastspeech: Fast, robust and controllable text to speech.Advances in neural information processing systems, 32, 2019

    Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech.Advances in neural information processing systems, 32, 2019

  65. [65]

    Fastspeech 2: Fast and high-quality end-to-end text to speech

    Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. InInternational Conference on Learning Representations

  66. [66]

    Glow-tts: A generative flow for text-to- speech via monotonic alignment search.Advances in Neural Information Processing Systems, 33:8067– 8077, 2020

    Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. Glow-tts: A generative flow for text-to- speech via monotonic alignment search.Advances in Neural Information Processing Systems, 33:8067– 8077, 2020

  67. [67]

    Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech

    Jaehyeon Kim, Jungil Kong, and Juhee Son. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. InInternational conference on machine learning, pages 5530–5540. PMLR, 2021. 13

  68. [68]

    Bigvgan: A universal neural vocoder with large-scale training

    Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. Bigvgan: A universal neural vocoder with large-scale training. InThe Eleventh International Conference on Learning Representations, 2023

  69. [69]

    Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis.Advances in neural information processing systems, 33:17022–17033, 2020

    Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis.Advances in neural information processing systems, 33:17022–17033, 2020

  70. [70]

    Mel-cepstral distance measure for objective speech quality assessment

    Robert Kubichek. Mel-cepstral distance measure for objective speech quality assessment. InProceedings of IEEE pacific rim conference on communications computers and signal processing, volume 1, pages 125–128. IEEE, 1993

  71. [71]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

  72. [72]

    Bertscore: Evaluating text generation with bert

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. InInternational Conference on Learning Representations, 2020

  73. [73]

    Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

  74. [74]

    Utmos: Utokyo-sarulab system for voicemos challenge 2022.Interspeech, 2022

    Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. Utmos: Utokyo-sarulab system for voicemos challenge 2022.Interspeech, 2022. 14 A Additional Experiment Settings A.1 Datasets and Preprocessing Brennan EEG dataset.Brennan EEG dataset is a naturalistic listening EEG dataset introduced by Brennan and Hale...