MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained Priors

Guangyin Bao; Jianfeng Feng; Taiping Zeng; Xiangyang Xue

arxiv: 2605.31173 · v1 · pith:YEETSPKVnew · submitted 2026-05-29 · 💻 cs.SD · cs.AI

MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained Priors

Guangyin Bao , Taiping Zeng , Jianfeng Feng , Xiangyang Xue This is my paper

Pith reviewed 2026-06-28 21:06 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords speech reconstructionEEGMEGpretrained modelsbrain-computer interfaceneural signalssemantic recoveryacoustic estimation

0 comments

The pith

MindVoice reconstructs intelligible speech from EEG and MEG by recovering semantic content and acoustic attributes separately with pretrained priors before fusing them into generated waveforms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that existing direct-mapping approaches produce spectral-similar but unintelligible speech because they cannot handle the incomplete and noisy information in non-invasive recordings. It shows that splitting the task into two pathways—one for high-level semantics and one for fine-grained acoustics—then combining the outputs with pretrained speech generation models and voice cloning yields natural, intelligible results. A sympathetic reader would care because this offers a route to practical non-invasive speech brain-computer interfaces and better tools for studying auditory perception. The experiments on EEG and MEG data demonstrate clear gains over prior methods. The central mechanism is the use of pretrained priors to fill gaps that raw neural signals leave open.

Core claim

MindVoice disentangles reconstruction into two complementary pathways: one recovers high-level semantic content while the other estimates fine-grained acoustic attributes. These inferred representations are fused with powerful speech generation models and in-context voice cloning to synthesize natural and intelligible utterances. Extensive experiments on EEG and MEG demonstrate that MindVoice substantially outperforms existing methods on various metrics, showing that pretrained priors provide a principled way to bridge the gap between noisy neural recordings and natural speech.

What carries the argument

Dual complementary pathways that separately recover semantic content and acoustic attributes from neural signals using pretrained models, then fuse the results with speech generation models.

If this is right

MindVoice substantially outperforms existing methods on various metrics when tested on EEG and MEG recordings.
Pretrained priors supply the missing semantic and acoustic information that direct neural-to-speech mappings lack.
The resulting outputs are natural and intelligible utterances rather than spectral-similar but unintelligible waveforms.
The framework offers a route to scalable non-invasive speech brain-computer interfaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of semantic and acoustic recovery could be tested on other non-invasive signals such as fNIRS to check whether the performance gain generalizes.
If the fusion step can be made faster, the approach might support real-time speech synthesis from ongoing brain recordings.
The method supplies an indirect way to measure how much semantic versus acoustic detail survives in different neural recording modalities.

Load-bearing premise

Pretrained models can accurately recover high-level semantic content and fine-grained acoustic attributes from the incomplete and noisy information present in non-invasive neural recordings.

What would settle it

An ablation test in which removing the pretrained priors from both pathways drops word error rate or intelligibility scores to levels indistinguishable from earlier direct-mapping baselines on the same EEG and MEG datasets.

Figures

Figures reproduced from arXiv: 2605.31173 by Guangyin Bao, Jianfeng Feng, Taiping Zeng, Xiangyang Xue.

**Figure 1.** Figure 1: Decoding of auditory-evoked non-invasive neural recodings. a, most extensive researches focus on decoding paraspeech information, whereas our goal is to directly reconstruct continuous and intelligible speech waveforms. b, existing EEG-to-speech methods align EEG with entangled speech representations and reconstruct speech using neural vocoders. c, our framework adopts a dual-stream architecture comprising… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed MindVoice framework. embedder, a speech vector-quantized autoencoder, and a neuro-to-semantic aligner, which together generate the reconstructed text tokens Tˆ. Neural Signal Embedder. We treat EEG/MEG signals with temporal and channel dimensions as single-channel images, i.e., Xi ∈ R 1×t×c . Under this formulation, cascaded CNNs progressively compress the input and learn represent… view at source ↗

**Figure 3.** Figure 3: Mel spectrogram and transcription results. a, comparison of mel spectrograms obtained by different methods with the ground-truth. b, transcription results of ground-truth speech, FESDE reconstructed speech, and our reconstructed speech using the Qwen3-ASR. More results can be found in Appendix B.1 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Analyses of model interpretability and preference. a, channel heatmap for semantic- and acousticlevel stream. Each pixel represents a neural signal channel. b, sentence length regression analysis between reconstructions and ground truth. c, baseline-scaled BERTScore-F1 across sentence length groups. could preserve the main sentence structure, partial keywords, and grammatical patterns, although it may sti… view at source ↗

**Figure 5.** Figure 5: Additional qualitative results on the Brennan EEG dataset. Randomly select. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Additional qualitative results on the Gwilliams MEG dataset. Randomly select. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Fine-grained linguistic preference analysis of MEG-to-text reconstruction. a, groundtruth and reconstructed sentence-length distributions. b, precision and recall of discourse/relation markers. c, POS-specific precision, recall, and F1. d, zipf frequency distributions of content words in five groups: GT, Recon, Hit, Missed, and Hallucinated. e–g, associations between BERTScore and function-word recall, ha… view at source ↗

read the original abstract

Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and building safe, scalable speech brain-computer interfaces. Despite recent progress, intelligible reconstruction remains elusive, as non-invasive recordings are inherently noisy, spatially blurred, and only partially preserve information about perceived speech. Existing methods directly map neural activity to entangled speech representations before synthesizing waveforms with neural vocoders, resulting in spectral-similar but unintelligible results. To overcome these limitations, we introduce MindVoice, a neuro-to-speech reconstruction framework that uses pretrained models to compensate for the incomplete semantic and acoustic information in neural recordings. MindVoice disentangles reconstruction into two complementary pathways: one recovers high-level semantic content, while the other estimates fine-grained acoustic attributes. These inferred representations are then fused with powerful speech generation models and in-context voice cloning to synthesize natural and intelligible utterances. Extensive experiments on EEG and MEG demonstrate that MindVoice substantially outperforms existing methods on various metrics. These results show that pretrained priors provide a principled way to bridge the gap between noisy neural recordings and natural speech, highlighting a promising attempt for auditory neuroscience research and non-invasive speech brain-computer interfaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MindVoice claims better intelligible speech from EEG/MEG by splitting semantic and acoustic recovery then fusing with pretrained generators, but the abstract gives no numbers or ablations so the neural contribution remains unproven.

read the letter

The core idea is a two-pathway setup that pulls high-level semantics from one neural route and acoustic details from another, then combines them with off-the-shelf speech models plus in-context cloning. That framing is new enough compared with the direct-mapping baselines mentioned.

The paper does a clean job naming the core failure mode of prior work: spectral similarity without intelligibility. Using pretrained priors to fill gaps in noisy non-invasive signals is a reasonable direction and avoids some of the circularity that would come from training everything end-to-end on the same small neural datasets.

The soft spot is exactly the one in the stress-test note. The abstract asserts outperformance on various metrics but supplies no tables, error bars, dataset sizes, or ablation results that isolate what the neural pathways actually contribute versus what the generative priors synthesize on their own. Without those controls it is impossible to tell whether the neural signals are driving the output or merely providing weak conditioning that the pretrained models largely ignore.

This is aimed at the non-invasive BCI and auditory neuroscience crowd. A reader already working on EEG speech decoding would get value from the architectural sketch even if the empirical claims need verification.

I would send it to peer review so the experiments can be checked for proper ablations and statistical reporting; the idea is worth testing but the current evidence level is too low to judge yet.

Referee Report

2 major / 1 minor

Summary. The paper introduces MindVoice, a neuro-to-speech reconstruction framework that disentangles the task into two complementary pathways—one recovering high-level semantic content from neural signals and the other estimating fine-grained acoustic attributes—then fuses these with pretrained speech generation models and in-context voice cloning to produce intelligible utterances from noisy, non-invasive EEG/MEG recordings. It claims this substantially outperforms existing direct-mapping methods on various metrics and shows that pretrained priors provide a principled bridge to natural speech.

Significance. If the empirical results hold after proper validation, the work would be significant for non-invasive speech BCIs and auditory neuroscience. The disentangled pathways plus fusion with external generative priors is a clear methodological step beyond direct neural-to-representation mapping, and the approach avoids circularity by relying on independent pretrained models.

major comments (2)

[Abstract] Abstract: the central claim of substantial outperformance on EEG/MEG is stated without any quantitative metrics, error bars, dataset sizes, or statistical tests. This absence prevents verification of whether the reported gains are load-bearing or affected by post-hoc choices.
[Methods/Results] Methods/Results (pathway fusion description): no ablation is reported that isolates the contribution of the neural-derived semantic and acoustic representations versus the pretrained speech generators alone. Without this test, the claim that neural signals meaningfully drive intelligible output (rather than the generators synthesizing from weak cues) remains unverified and is load-bearing for the central thesis.

minor comments (1)

[Methods] Notation for the two pathways and their fusion step could be formalized with equations to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, agreeing where revisions are warranted to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of substantial outperformance on EEG/MEG is stated without any quantitative metrics, error bars, dataset sizes, or statistical tests. This absence prevents verification of whether the reported gains are load-bearing or affected by post-hoc choices.

Authors: We agree that the abstract would be strengthened by including key quantitative details. In the revised version, we will incorporate specific metrics (e.g., WER improvements, dataset sizes, and statistical tests) drawn from the results section to allow immediate assessment of the claims. revision: yes
Referee: [Methods/Results] Methods/Results (pathway fusion description): no ablation is reported that isolates the contribution of the neural-derived semantic and acoustic representations versus the pretrained speech generators alone. Without this test, the claim that neural signals meaningfully drive intelligible output (rather than the generators synthesizing from weak cues) remains unverified and is load-bearing for the central thesis.

Authors: This point is well-taken and directly addresses a load-bearing aspect of the central thesis. We will add a dedicated ablation experiment in the revised manuscript that compares the full pipeline against variants using only the pretrained generators (with neural inputs replaced by noise or null signals) to quantify the specific contribution of the neural-derived representations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pretrained models

full rationale

The paper presents MindVoice as a framework that disentangles neural-to-representation pathways and fuses outputs with independent pretrained speech generation models and voice cloning. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or result to the input neural data by construction. The pretrained priors are described as external and independent of the EEG/MEG recordings, satisfying the condition for self-contained derivation against external benchmarks. The skeptic concern about ablation is a question of empirical validation, not circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pretrained speech and language models contain sufficient semantic and acoustic knowledge to compensate for information lost in non-invasive neural recordings. No free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Pretrained models can recover high-level semantic content and fine-grained acoustic attributes from incomplete neural signals.
This premise is required for the two-pathway design to succeed and is invoked when the abstract states that the pathways 'recover' and 'estimate' the missing information.

pith-pipeline@v0.9.1-grok · 5737 in / 1440 out tokens · 23321 ms · 2026-06-28T21:06:32.174568+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 15 canonical work pages · 5 internal anchors

[1]

The cortical organization of speech processing.Nature reviews neuroscience, 8(5):393–402, 2007

Gregory Hickok and David Poeppel. The cortical organization of speech processing.Nature reviews neuroscience, 8(5):393–402, 2007

2007
[2]

Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing.Nature neuroscience, 12(6):718–724, 2009

Josef P Rauschecker and Sophie K Scott. Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing.Nature neuroscience, 12(6):718–724, 2009

2009
[3]

Hierarchical processing in spoken language comprehension

Matthew H Davis and Ingrid S Johnsrude. Hierarchical processing in spoken language comprehension. Journal of Neuroscience, 23(8):3423–3431, 2003

2003
[4]

Phonetic feature encoding in human superior temporal gyrus.Science, 343(6174):1006–1010, 2014

Nima Mesgarani, Connie Cheung, Keith Johnson, and Edward F Chang. Phonetic feature encoding in human superior temporal gyrus.Science, 343(6174):1006–1010, 2014

2014
[5]

An instantaneous voice-synthesis neuroprosthesis.Nature, 644(8075):145–152, 2025

Maitreyee Wairagkar, Nicholas S Card, Tyler Singer-Clark, Xianda Hou, Carrina Iacobacci, Lee M Miller, Leigh R Hochberg, David M Brandman, and Sergey D Stavisky. An instantaneous voice-synthesis neuroprosthesis.Nature, 644(8075):145–152, 2025

2025
[6]

Speech synthesis from neural decoding of spoken sentences.Nature, 568(7753):493–498, 2019

Gopala K Anumanchipalli, Josh Chartier, and Edward F Chang. Speech synthesis from neural decoding of spoken sentences.Nature, 568(7753):493–498, 2019

2019
[7]

Inner speech in motor cortex and implications for speech neuroprostheses.Cell, 188(17):4658–4673, 2025

Erin M Kunz, Benyamin Abramovich Krasa, Foram Kamdar, Donald T Avansino, Nick Hahn, Seonghyun Yoon, Akansha Singh, Samuel R Nason-Tomaszewski, Nicholas S Card, Justin J Jude, et al. Inner speech in motor cortex and implications for speech neuroprostheses.Cell, 188(17):4658–4673, 2025

2025
[8]

A streaming brain- to-voice neuroprosthesis to restore naturalistic communication.Nature neuroscience, 28(4):902–912, 2025

Kaylo T Littlejohn, Cheol Jun Cho, Jessie R Liu, Alexander B Silva, Bohan Yu, Vanessa R Anderson, Cady M Kurtz-Miott, Samantha Brosler, Anshul P Kashyap, Irina P Hallinan, et al. A streaming brain- to-voice neuroprosthesis to restore naturalistic communication.Nature neuroscience, 28(4):902–912, 2025

2025
[9]

A high-performance neuroprosthesis for speech decoding and avatar control.Nature, 620(7976):1037–1046, 2023

Sean L Metzger, Kaylo T Littlejohn, Alexander B Silva, David A Moses, Margaret P Seaton, Ran Wang, Maximilian E Dougherty, Jessie R Liu, Peter Wu, Michael A Berger, et al. A high-performance neuroprosthesis for speech decoding and avatar control.Nature, 620(7976):1037–1046, 2023

2023
[10]

Neural responses to uninterrupted natural speech can be extracted with precise temporal resolution.European journal of neuroscience, 31(1):189–193, 2010

Edmund C Lalor and John J Foxe. Neural responses to uninterrupted natural speech can be extracted with precise temporal resolution.European journal of neuroscience, 31(1):189–193, 2010

2010
[11]

Comparing the potential of meg and eeg to uncover brain tracking of speech temporal envelope.Neuroimage, 184:201–213, 2019

Florian Destoky, Morgane Philippe, Julie Bertels, Marie Verhasselt, Nicolas Coquelet, Marc Vander Ghinst, Vincent Wens, Xavier De Tiège, and Mathieu Bourguignon. Comparing the potential of meg and eeg to uncover brain tracking of speech temporal envelope.Neuroimage, 184:201–213, 2019

2019
[12]

Attentional selection in a cocktail party environment can be decoded from single-trial eeg.Cerebral cortex, 25(7):1697–1706, 2015

James A O’sullivan, Alan J Power, Nima Mesgarani, Siddharth Rajaram, John J Foxe, Barbara G Shinn- Cunningham, Malcolm Slaney, Shihab A Shamma, and Edmund C Lalor. Attentional selection in a cocktail party environment can be decoded from single-trial eeg.Cerebral cortex, 25(7):1697–1706, 2015. 10

2015
[13]

Decoding selective auditory attention with eeg using a transformer model.Methods, 204:410–417, 2022

Zihao Xu, Yanru Bai, Ran Zhao, Hongmei Hu, Guangjian Ni, and Dong Ming. Decoding selective auditory attention with eeg using a transformer model.Methods, 204:410–417, 2022

2022
[14]

Decoding of the speech envelope from eeg using the vlaai deep neural network.Scientific Reports, 13(1):812, 2023

Bernd Accou, Jonas Vanthornhout, Hugo Van hamme, and Tom Francart. Decoding of the speech envelope from eeg using the vlaai deep neural network.Scientific Reports, 13(1):812, 2023

2023
[15]

sound of silence

Rini A Sharon and Hema A Murthy. The “sound of silence” in eeg—cognitive voice activity detection. In Proceedings of Interspeech, pages 2767–2771, 2020

2020
[16]

Libribrain: Over 50 hours of within-subject meg to improve speech decoding methods at scale.arXiv preprint arXiv:2506.02098, 2025

Miran Özdogan, Gilad Landau, Gereon Elvers, Dulhan Jayalath, Pratik Somaiya, Francesco Mantegna, Mark Woolrich, and Oiwi Parker Jones. Libribrain: Over 50 hours of within-subject meg to improve speech decoding methods at scale.arXiv preprint arXiv:2506.02098, 2025

work page arXiv 2025
[17]

The 2025 pnpl competition: Speech detection and phoneme classification in the libribrain dataset.arXiv preprint arXiv:2506.10165, 2025

Gilad Landau, Miran Özdogan, Gereon Elvers, Francesco Mantegna, Pratik Somaiya, Dulhan Jayalath, Luisa Kurth, Teyun Kwon, Brendan Shillingford, Greg Farquhar, et al. The 2025 pnpl competition: Speech detection and phoneme classification in the libribrain dataset.arXiv preprint arXiv:2506.10165, 2025

work page arXiv 2025
[18]

Megconformer: Conformer-based meg decoder for robust speech and phoneme classification.arXiv preprint arXiv:2512.01443, 2025

Xabier de Zuazo, Ibon Saratxaga, and Eva Navas. Megconformer: Conformer-based meg decoder for robust speech and phoneme classification.arXiv preprint arXiv:2512.01443, 2025

work page arXiv 2025
[19]

Shine: Sequential hierarchical integration network for eeg and meg.arXiv preprint arXiv:2602.23960, 2026

Xiran Xu, Yujie Yan, Xihong Wu, and Jing Chen. Shine: Sequential hierarchical integration network for eeg and meg.arXiv preprint arXiv:2602.23960, 2026

work page arXiv 2026
[20]

Decoding speech perception from non-invasive brain recordings.Nature Machine Intelligence, 5(10):1097–1107, 2023

Alexandre Défossez, Charlotte Caucheteux, Jérémy Rapin, Ori Kabeli, and Jean-Rémi King. Decoding speech perception from non-invasive brain recordings.Nature Machine Intelligence, 5(10):1097–1107, 2023

2023
[21]

Cross-attention-guided wavenet for eeg-to-mel spectrogram reconstruction

Hao Li, Yuan Fang, Xueliang Zhang, Fei Chen, and Guanglai Gao. Cross-attention-guided wavenet for eeg-to-mel spectrogram reconstruction. InProceedings of Interspeech, 2024

2024
[22]

Towards decoding individual words from non-invasive brain recordings.Nature Communications, 16(1):10521, 2025

Stéphane d’Ascoli, Corentin Bel, Jérémy Rapin, Hubert Banville, Yohann Benchetrit, Christophe Pallier, and Jean-Rémi King. Towards decoding individual words from non-invasive brain recordings.Nature Communications, 16(1):10521, 2025

2025
[23]

Decoding the temporal dynamics of spoken word and nonword processing from eeg

Bob McMurray, McCall E Sarrett, Samantha Chiu, Alexis K Black, Alice Wang, Rebecca Canale, and Richard N Aslin. Decoding the temporal dynamics of spoken word and nonword processing from eeg. NeuroImage, 260:119457, 2022

2022
[24]

Toward fully-end-to-end listened speech decoding from eeg signals

Jihwan Lee, Aditya Kommineni, Tiantian Feng, Kleanthis Avramidis, Xuan Shi, Sudarsana Reddy Kadiri, and Shrikanth Narayanan. Toward fully-end-to-end listened speech decoding from eeg signals. In Interspeech, pages 1500–1504, 2024

2024
[25]

En- hancing listened speech decoding from eeg via parallel phoneme sequence prediction

Jihwan Lee, Tiantian Feng, Aditya Kommineni, Sudarsana Reddy Kadiri, and Shrikanth Narayanan. En- hancing listened speech decoding from eeg via parallel phoneme sequence prediction. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2025

2025
[26]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

2023
[27]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, et al. Seamlessm4t: Massively multilingual & multimodal machine translation.arXiv preprint arXiv:2308.11596, 2023

work page arXiv 2023
[28]

Available: https://arxiv.org/abs/2312.05187

Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, et al. Seamless: Multilingual expressive and streaming speech translation.arXiv preprint arXiv:2312.05187, 2023

work page arXiv 2023
[29]

Funasr: A fundamental end-to-end speech recognition toolkit

Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, and Shiliang Zhang. Funasr: A fundamental end-to-end speech recognition toolkit. In Proceedings of Interspeech, pages 1593–1597, 2023

2023
[30]

Qwen3-ASR Technical Report

Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, et al. Qwen3-asr technical report.arXiv preprint arXiv:2601.21337, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, JianZhao JianZhao, Kai Yu, and Xie Chen. F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 6255–6271, 2025

2025
[33]

Fish-speech: Leveraging large language models for advanced multilingual text-to-speech synthesis.arXiv preprint arXiv:2411.01156, 2024

Shijia Liao, Yuxuan Wang, Tianyu Li, Yifan Cheng, Ruoyi Zhang, Rongzhi Zhou, and Yijin Xing. Fish- speech: Leveraging large language models for advanced multilingual text-to-speech synthesis.arXiv preprint arXiv:2411.01156, 2024

work page arXiv 2024
[34]

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, et al. Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, 2025

2025
[37]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Brain decoding: toward real-time reconstruction of visual perception

Yohann Benchetrit, Hubert Banville, and Jean-Remi King. Brain decoding: toward real-time reconstruction of visual perception. InThe Twelfth International Conference on Learning Representations
[39]

Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding

Zijiao Chen, Jiaxin Qing, Tiange Xiang, Wan Lin Yue, and Juan Helen Zhou. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22710–22720, 2023

2023
[40]

Reconstructing the mind’s eye: fmri-to- image with contrastive learning and diffusion priors.Advances in Neural Information Processing Systems, 36:24705–24728, 2023

Paul Scotti, Atmadeep Banerjee, Jimmie Goode, Stepan Shabalin, Alex Nguyen, Aidan Dempster, Nathalie Verlinde, Elad Yundler, David Weisberg, Kenneth Norman, et al. Reconstructing the mind’s eye: fmri-to- image with contrastive learning and diffusion priors.Advances in Neural Information Processing Systems, 36:24705–24728, 2023

2023
[41]

Mindeye2: shared-subject models enable fmri-to-image with 1 hour of data

Paul S Scotti, Mihir Tripathy, Cesare Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A Norman, et al. Mindeye2: shared-subject models enable fmri-to-image with 1 hour of data. InProceedings of the 41st International Conference on Machine Learning, pages 44038–44059, 2024

2024
[42]

Visual decoding and reconstruction via eeg embeddings with guided diffusion

Dongyang Li, Chen Wei, Shiying Li, Jiachen Zou, and Quanying Liu. Visual decoding and reconstruction via eeg embeddings with guided diffusion. InProceedings of the 38th International Conference on Neural Information Processing Systems, pages 102822–102864, 2024

2024
[43]

Cinematic mindscapes: High-quality video reconstruction from brain activity.Advances in Neural Information Processing Systems, 36:24841–24858, 2023

Zijiao Chen, Jiaxin Qing, and Juan Helen Zhou. Cinematic mindscapes: High-quality video reconstruction from brain activity.Advances in Neural Information Processing Systems, 36:24841–24858, 2023

2023
[44]

Neuroclips: Towards high-fidelity and smooth fmri-to-video reconstruction.Advances in Neural Information Processing Systems, 37:51655–51683, 2024

Zixuan Gong, Guangyin Bao, Qi Zhang, Zhongwei Wan, Duoqian Miao, Shoujin Wang, Lei Zhu, Chang- wei Wang, Rongtao Xu, Liang Hu, et al. Neuroclips: Towards high-fidelity and smooth fmri-to-video reconstruction.Advances in Neural Information Processing Systems, 37:51655–51683, 2024

2024
[45]

Eeg2video: Towards decoding dynamic visual perception from eeg signals.Advances in Neural Information Processing Systems, 37:72245–72273, 2024

Xuan-Hao Liu, Yan-Kai Liu, Yansen Wang, Kan Ren, Hanwen Shi, Zilong Wang, Dongsheng Li, Bao- Liang Lu, and Wei-Long Zheng. Eeg2video: Towards decoding dynamic visual perception from eeg signals.Advances in Neural Information Processing Systems, 37:72245–72273, 2024

2024
[46]

Animate your thoughts: Reconstruction of dynamic natural vision from human brain activity

Yizhuo Lu, Changde Du, Chong Wang, Xuanliu Zhu, Liuyun Jiang, Xujin Li, and Huiguang He. Animate your thoughts: Reconstruction of dynamic natural vision from human brain activity. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[47]

Reanimating images using neural representations of dynamic stimuli

Jacob Yeung, Andrew F Luo, Gabriel Sarch, Margaret M Henderson, Deva Ramanan, and Michael J Tarr. Reanimating images using neural representations of dynamic stimuli. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5331–5343, 2025

2025
[48]

Hierarchical structure guides rapid linguistic predictions during naturalistic listening.PloS one, 14(1):e0207741, 2019

Jonathan R Brennan and John T Hale. Hierarchical structure guides rapid linguistic predictions during naturalistic listening.PloS one, 14(1):e0207741, 2019

2019
[49]

Introducing meg-masc a high-quality magneto-encephalography dataset for evaluating natural speech processing.Scientific data, 10(1):862, 2023

Laura Gwilliams, Graham Flick, Alec Marantz, Liina Pylkkänen, David Poeppel, and Jean-Rémi King. Introducing meg-masc a high-quality magneto-encephalography dataset for evaluating natural speech processing.Scientific data, 10(1):862, 2023. 12

2023
[50]

A high-performance speech neuroprosthesis.Nature, 620(7976):1031–1036, 2023

Francis R Willett, Erin M Kunz, Chaofei Fan, Donald T Avansino, Guy H Wilson, Eun Young Choi, Foram Kamdar, Matthew F Glasser, Leigh R Hochberg, Shaul Druckmann, et al. A high-performance speech neuroprosthesis.Nature, 620(7976):1031–1036, 2023

2023
[51]

Decoding intended speech with an intracortical brain-computer interface in a person with long-standing anarthria and locked-in syndrome.Cell Reports, 45(4), 2026

Justin J Jude, Stephanie Haro, Hadar Levi-Aharoni, Hiroaki Hashimoto, Alexander J Acosta, Nicholas S Card, Maitreyee Wairagkar, David M Brandman, Sergey D Stavisky, Ziv M Williams, et al. Decoding intended speech with an intracortical brain-computer interface in a person with long-standing anarthria and locked-in syndrome.Cell Reports, 45(4), 2026

2026
[52]

Towards voice reconstruction from eeg during imagined speech

Young-Eun Lee, Seo-Hyun Lee, Sang-Ho Kim, and Seong-Whan Lee. Towards voice reconstruction from eeg during imagined speech. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 6030–6038, 2023

2023
[53]

Imagined speech classification using eeg and deep learning.Bioengineering, 10(6):649, 2023

Mokhles M Abdulghani, Wilbur L Walters, and Khalid H Abed. Imagined speech classification using eeg and deep learning.Bioengineering, 10(6):649, 2023

2023
[54]

Scaling law in neural data: Non-invasive speech decoding with 175 hours of eeg data.arXiv preprint arXiv:2407.07595, 2024

Motoshige Sato, Kenichi Tomeoka, Ilya Horiguchi, Kai Arulkumaran, Ryota Kanai, and Shuntaro Sasai. Scaling law in neural data: Non-invasive speech decoding with 175 hours of eeg data.arXiv preprint arXiv:2407.07595, 2024

work page arXiv 2024
[55]

Mad: Multi-alignment meg-to-text decoding.arXiv preprint arXiv:2406.01512, 2024

Yiqian Yang, Hyejeong Jo, Yiqun Duan, Qiang Zhang, Jinni Zhou, Xuming Hu, Won Hee Lee, Renjing Xu, and Hui Xiong. Mad: Multi-alignment meg-to-text decoding.arXiv preprint arXiv:2406.01512, 2024

work page arXiv 2024
[56]

Brainecho: Semantic brain signal decoding through vector-quantized spectrogram reconstruction for whisper-enhanced text generation

Jilong Li, Zhenxi Song, Jiaqi Wang, Meishan Zhang, Honghai Liu, Min Zhang, and Zhiguo Zhang. Brainecho: Semantic brain signal decoding through vector-quantized spectrogram reconstruction for whisper-enhanced text generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 2762–2778, 2025

2025
[57]

Neuspeech: Decode neural signal as speech

Yiqian Yang, Yiqun Duan, Qiang Zhang, Hyejeong Jo, Jinni Zhou, Won Hee Lee, Renjing Xu, and Hui Xiong. Neuspeech: Decode neural signal as speech. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 6636–6640. IEEE, 2026

2026
[58]

Towards sentence level imagined speech generation from eeg signals

Sparsh Rastogi, Harsh Dadwal, Khushboo Modi, Jatin Bedi, and Jasmeet Singh. Towards sentence level imagined speech generation from eeg signals. InProc. Interspeech 2025, pages 5558–5562, 2025

2025
[59]

Are eeg-to-text models working?arXiv preprint arXiv:2405.06459, 2024

Hyejeong Jo, Yiqian Yang, Juhyeok Han, Yiqun Duan, Hui Xiong, and Won Hee Lee. Are eeg-to-text models working?arXiv preprint arXiv:2405.06459, 2024

work page arXiv 2024
[60]

Deep speech 2: End-to-end speech recognition in english and mandarin

Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. InInternational conference on machine learning, pages 173–182. PMLR, 2016

2016
[61]

Conformer: Convolution-augmented transformer for speech recognition

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented transformer for speech recognition. InProc. Interspeech 2020, pages 5036–5040, 2020

2020
[62]

wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

2020
[63]

Deep voice: Real-time neural text-to-speech

Sercan Ö Arık, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, et al. Deep voice: Real-time neural text-to-speech. In International conference on machine learning, pages 195–204. PMLR, 2017

2017
[64]

Fastspeech: Fast, robust and controllable text to speech.Advances in neural information processing systems, 32, 2019

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech.Advances in neural information processing systems, 32, 2019

2019
[65]

Fastspeech 2: Fast and high-quality end-to-end text to speech

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. InInternational Conference on Learning Representations
[66]

Glow-tts: A generative flow for text-to- speech via monotonic alignment search.Advances in Neural Information Processing Systems, 33:8067– 8077, 2020

Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. Glow-tts: A generative flow for text-to- speech via monotonic alignment search.Advances in Neural Information Processing Systems, 33:8067– 8077, 2020

2020
[67]

Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech

Jaehyeon Kim, Jungil Kong, and Juhee Son. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. InInternational conference on machine learning, pages 5530–5540. PMLR, 2021. 13

2021
[68]

Bigvgan: A universal neural vocoder with large-scale training

Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. Bigvgan: A universal neural vocoder with large-scale training. InThe Eleventh International Conference on Learning Representations, 2023

2023
[69]

Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis.Advances in neural information processing systems, 33:17022–17033, 2020

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis.Advances in neural information processing systems, 33:17022–17033, 2020

2020
[70]

Mel-cepstral distance measure for objective speech quality assessment

Robert Kubichek. Mel-cepstral distance measure for objective speech quality assessment. InProceedings of IEEE pacific rim conference on communications computers and signal processing, volume 1, pages 125–128. IEEE, 1993

1993
[71]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

2021
[72]

Bertscore: Evaluating text generation with bert

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. InInternational Conference on Learning Representations, 2020

2020
[73]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

2022
[74]

Utmos: Utokyo-sarulab system for voicemos challenge 2022.Interspeech, 2022

Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. Utmos: Utokyo-sarulab system for voicemos challenge 2022.Interspeech, 2022. 14 A Additional Experiment Settings A.1 Datasets and Preprocessing Brennan EEG dataset.Brennan EEG dataset is a naturalistic listening EEG dataset introduced by Brennan and Hale...

2022

[1] [1]

The cortical organization of speech processing.Nature reviews neuroscience, 8(5):393–402, 2007

Gregory Hickok and David Poeppel. The cortical organization of speech processing.Nature reviews neuroscience, 8(5):393–402, 2007

2007

[2] [2]

Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing.Nature neuroscience, 12(6):718–724, 2009

Josef P Rauschecker and Sophie K Scott. Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing.Nature neuroscience, 12(6):718–724, 2009

2009

[3] [3]

Hierarchical processing in spoken language comprehension

Matthew H Davis and Ingrid S Johnsrude. Hierarchical processing in spoken language comprehension. Journal of Neuroscience, 23(8):3423–3431, 2003

2003

[4] [4]

Phonetic feature encoding in human superior temporal gyrus.Science, 343(6174):1006–1010, 2014

Nima Mesgarani, Connie Cheung, Keith Johnson, and Edward F Chang. Phonetic feature encoding in human superior temporal gyrus.Science, 343(6174):1006–1010, 2014

2014

[5] [5]

An instantaneous voice-synthesis neuroprosthesis.Nature, 644(8075):145–152, 2025

Maitreyee Wairagkar, Nicholas S Card, Tyler Singer-Clark, Xianda Hou, Carrina Iacobacci, Lee M Miller, Leigh R Hochberg, David M Brandman, and Sergey D Stavisky. An instantaneous voice-synthesis neuroprosthesis.Nature, 644(8075):145–152, 2025

2025

[6] [6]

Speech synthesis from neural decoding of spoken sentences.Nature, 568(7753):493–498, 2019

Gopala K Anumanchipalli, Josh Chartier, and Edward F Chang. Speech synthesis from neural decoding of spoken sentences.Nature, 568(7753):493–498, 2019

2019

[7] [7]

Inner speech in motor cortex and implications for speech neuroprostheses.Cell, 188(17):4658–4673, 2025

Erin M Kunz, Benyamin Abramovich Krasa, Foram Kamdar, Donald T Avansino, Nick Hahn, Seonghyun Yoon, Akansha Singh, Samuel R Nason-Tomaszewski, Nicholas S Card, Justin J Jude, et al. Inner speech in motor cortex and implications for speech neuroprostheses.Cell, 188(17):4658–4673, 2025

2025

[8] [8]

A streaming brain- to-voice neuroprosthesis to restore naturalistic communication.Nature neuroscience, 28(4):902–912, 2025

Kaylo T Littlejohn, Cheol Jun Cho, Jessie R Liu, Alexander B Silva, Bohan Yu, Vanessa R Anderson, Cady M Kurtz-Miott, Samantha Brosler, Anshul P Kashyap, Irina P Hallinan, et al. A streaming brain- to-voice neuroprosthesis to restore naturalistic communication.Nature neuroscience, 28(4):902–912, 2025

2025

[9] [9]

A high-performance neuroprosthesis for speech decoding and avatar control.Nature, 620(7976):1037–1046, 2023

Sean L Metzger, Kaylo T Littlejohn, Alexander B Silva, David A Moses, Margaret P Seaton, Ran Wang, Maximilian E Dougherty, Jessie R Liu, Peter Wu, Michael A Berger, et al. A high-performance neuroprosthesis for speech decoding and avatar control.Nature, 620(7976):1037–1046, 2023

2023

[10] [10]

Neural responses to uninterrupted natural speech can be extracted with precise temporal resolution.European journal of neuroscience, 31(1):189–193, 2010

Edmund C Lalor and John J Foxe. Neural responses to uninterrupted natural speech can be extracted with precise temporal resolution.European journal of neuroscience, 31(1):189–193, 2010

2010

[11] [11]

Comparing the potential of meg and eeg to uncover brain tracking of speech temporal envelope.Neuroimage, 184:201–213, 2019

Florian Destoky, Morgane Philippe, Julie Bertels, Marie Verhasselt, Nicolas Coquelet, Marc Vander Ghinst, Vincent Wens, Xavier De Tiège, and Mathieu Bourguignon. Comparing the potential of meg and eeg to uncover brain tracking of speech temporal envelope.Neuroimage, 184:201–213, 2019

2019

[12] [12]

Attentional selection in a cocktail party environment can be decoded from single-trial eeg.Cerebral cortex, 25(7):1697–1706, 2015

James A O’sullivan, Alan J Power, Nima Mesgarani, Siddharth Rajaram, John J Foxe, Barbara G Shinn- Cunningham, Malcolm Slaney, Shihab A Shamma, and Edmund C Lalor. Attentional selection in a cocktail party environment can be decoded from single-trial eeg.Cerebral cortex, 25(7):1697–1706, 2015. 10

2015

[13] [13]

Decoding selective auditory attention with eeg using a transformer model.Methods, 204:410–417, 2022

Zihao Xu, Yanru Bai, Ran Zhao, Hongmei Hu, Guangjian Ni, and Dong Ming. Decoding selective auditory attention with eeg using a transformer model.Methods, 204:410–417, 2022

2022

[14] [14]

Decoding of the speech envelope from eeg using the vlaai deep neural network.Scientific Reports, 13(1):812, 2023

Bernd Accou, Jonas Vanthornhout, Hugo Van hamme, and Tom Francart. Decoding of the speech envelope from eeg using the vlaai deep neural network.Scientific Reports, 13(1):812, 2023

2023

[15] [15]

sound of silence

Rini A Sharon and Hema A Murthy. The “sound of silence” in eeg—cognitive voice activity detection. In Proceedings of Interspeech, pages 2767–2771, 2020

2020

[16] [16]

Libribrain: Over 50 hours of within-subject meg to improve speech decoding methods at scale.arXiv preprint arXiv:2506.02098, 2025

Miran Özdogan, Gilad Landau, Gereon Elvers, Dulhan Jayalath, Pratik Somaiya, Francesco Mantegna, Mark Woolrich, and Oiwi Parker Jones. Libribrain: Over 50 hours of within-subject meg to improve speech decoding methods at scale.arXiv preprint arXiv:2506.02098, 2025

work page arXiv 2025

[17] [17]

The 2025 pnpl competition: Speech detection and phoneme classification in the libribrain dataset.arXiv preprint arXiv:2506.10165, 2025

Gilad Landau, Miran Özdogan, Gereon Elvers, Francesco Mantegna, Pratik Somaiya, Dulhan Jayalath, Luisa Kurth, Teyun Kwon, Brendan Shillingford, Greg Farquhar, et al. The 2025 pnpl competition: Speech detection and phoneme classification in the libribrain dataset.arXiv preprint arXiv:2506.10165, 2025

work page arXiv 2025

[18] [18]

Megconformer: Conformer-based meg decoder for robust speech and phoneme classification.arXiv preprint arXiv:2512.01443, 2025

Xabier de Zuazo, Ibon Saratxaga, and Eva Navas. Megconformer: Conformer-based meg decoder for robust speech and phoneme classification.arXiv preprint arXiv:2512.01443, 2025

work page arXiv 2025

[19] [19]

Shine: Sequential hierarchical integration network for eeg and meg.arXiv preprint arXiv:2602.23960, 2026

Xiran Xu, Yujie Yan, Xihong Wu, and Jing Chen. Shine: Sequential hierarchical integration network for eeg and meg.arXiv preprint arXiv:2602.23960, 2026

work page arXiv 2026

[20] [20]

Decoding speech perception from non-invasive brain recordings.Nature Machine Intelligence, 5(10):1097–1107, 2023

Alexandre Défossez, Charlotte Caucheteux, Jérémy Rapin, Ori Kabeli, and Jean-Rémi King. Decoding speech perception from non-invasive brain recordings.Nature Machine Intelligence, 5(10):1097–1107, 2023

2023

[21] [21]

Cross-attention-guided wavenet for eeg-to-mel spectrogram reconstruction

Hao Li, Yuan Fang, Xueliang Zhang, Fei Chen, and Guanglai Gao. Cross-attention-guided wavenet for eeg-to-mel spectrogram reconstruction. InProceedings of Interspeech, 2024

2024

[22] [22]

Towards decoding individual words from non-invasive brain recordings.Nature Communications, 16(1):10521, 2025

Stéphane d’Ascoli, Corentin Bel, Jérémy Rapin, Hubert Banville, Yohann Benchetrit, Christophe Pallier, and Jean-Rémi King. Towards decoding individual words from non-invasive brain recordings.Nature Communications, 16(1):10521, 2025

2025

[23] [23]

Decoding the temporal dynamics of spoken word and nonword processing from eeg

Bob McMurray, McCall E Sarrett, Samantha Chiu, Alexis K Black, Alice Wang, Rebecca Canale, and Richard N Aslin. Decoding the temporal dynamics of spoken word and nonword processing from eeg. NeuroImage, 260:119457, 2022

2022

[24] [24]

Toward fully-end-to-end listened speech decoding from eeg signals

Jihwan Lee, Aditya Kommineni, Tiantian Feng, Kleanthis Avramidis, Xuan Shi, Sudarsana Reddy Kadiri, and Shrikanth Narayanan. Toward fully-end-to-end listened speech decoding from eeg signals. In Interspeech, pages 1500–1504, 2024

2024

[25] [25]

En- hancing listened speech decoding from eeg via parallel phoneme sequence prediction

Jihwan Lee, Tiantian Feng, Aditya Kommineni, Sudarsana Reddy Kadiri, and Shrikanth Narayanan. En- hancing listened speech decoding from eeg via parallel phoneme sequence prediction. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2025

2025

[26] [26]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

2023

[27] [27]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, et al. Seamlessm4t: Massively multilingual & multimodal machine translation.arXiv preprint arXiv:2308.11596, 2023

work page arXiv 2023

[28] [28]

Available: https://arxiv.org/abs/2312.05187

Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, et al. Seamless: Multilingual expressive and streaming speech translation.arXiv preprint arXiv:2312.05187, 2023

work page arXiv 2023

[29] [29]

Funasr: A fundamental end-to-end speech recognition toolkit

Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, and Shiliang Zhang. Funasr: A fundamental end-to-end speech recognition toolkit. In Proceedings of Interspeech, pages 1593–1597, 2023

2023

[30] [30]

Qwen3-ASR Technical Report

Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, et al. Qwen3-asr technical report.arXiv preprint arXiv:2601.21337, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, JianZhao JianZhao, Kai Yu, and Xie Chen. F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 6255–6271, 2025

2025

[33] [33]

Fish-speech: Leveraging large language models for advanced multilingual text-to-speech synthesis.arXiv preprint arXiv:2411.01156, 2024

Shijia Liao, Yuxuan Wang, Tianyu Li, Yifan Cheng, Ruoyi Zhang, Rongzhi Zhou, and Yijin Xing. Fish- speech: Leveraging large language models for advanced multilingual text-to-speech synthesis.arXiv preprint arXiv:2411.01156, 2024

work page arXiv 2024

[34] [34]

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, et al. Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, 2025

2025

[37] [37]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Brain decoding: toward real-time reconstruction of visual perception

Yohann Benchetrit, Hubert Banville, and Jean-Remi King. Brain decoding: toward real-time reconstruction of visual perception. InThe Twelfth International Conference on Learning Representations

[39] [39]

Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding

Zijiao Chen, Jiaxin Qing, Tiange Xiang, Wan Lin Yue, and Juan Helen Zhou. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22710–22720, 2023

2023

[40] [40]

Reconstructing the mind’s eye: fmri-to- image with contrastive learning and diffusion priors.Advances in Neural Information Processing Systems, 36:24705–24728, 2023

Paul Scotti, Atmadeep Banerjee, Jimmie Goode, Stepan Shabalin, Alex Nguyen, Aidan Dempster, Nathalie Verlinde, Elad Yundler, David Weisberg, Kenneth Norman, et al. Reconstructing the mind’s eye: fmri-to- image with contrastive learning and diffusion priors.Advances in Neural Information Processing Systems, 36:24705–24728, 2023

2023

[41] [41]

Mindeye2: shared-subject models enable fmri-to-image with 1 hour of data

Paul S Scotti, Mihir Tripathy, Cesare Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A Norman, et al. Mindeye2: shared-subject models enable fmri-to-image with 1 hour of data. InProceedings of the 41st International Conference on Machine Learning, pages 44038–44059, 2024

2024

[42] [42]

Visual decoding and reconstruction via eeg embeddings with guided diffusion

Dongyang Li, Chen Wei, Shiying Li, Jiachen Zou, and Quanying Liu. Visual decoding and reconstruction via eeg embeddings with guided diffusion. InProceedings of the 38th International Conference on Neural Information Processing Systems, pages 102822–102864, 2024

2024

[43] [43]

Cinematic mindscapes: High-quality video reconstruction from brain activity.Advances in Neural Information Processing Systems, 36:24841–24858, 2023

Zijiao Chen, Jiaxin Qing, and Juan Helen Zhou. Cinematic mindscapes: High-quality video reconstruction from brain activity.Advances in Neural Information Processing Systems, 36:24841–24858, 2023

2023

[44] [44]

Neuroclips: Towards high-fidelity and smooth fmri-to-video reconstruction.Advances in Neural Information Processing Systems, 37:51655–51683, 2024

Zixuan Gong, Guangyin Bao, Qi Zhang, Zhongwei Wan, Duoqian Miao, Shoujin Wang, Lei Zhu, Chang- wei Wang, Rongtao Xu, Liang Hu, et al. Neuroclips: Towards high-fidelity and smooth fmri-to-video reconstruction.Advances in Neural Information Processing Systems, 37:51655–51683, 2024

2024

[45] [45]

Eeg2video: Towards decoding dynamic visual perception from eeg signals.Advances in Neural Information Processing Systems, 37:72245–72273, 2024

Xuan-Hao Liu, Yan-Kai Liu, Yansen Wang, Kan Ren, Hanwen Shi, Zilong Wang, Dongsheng Li, Bao- Liang Lu, and Wei-Long Zheng. Eeg2video: Towards decoding dynamic visual perception from eeg signals.Advances in Neural Information Processing Systems, 37:72245–72273, 2024

2024

[46] [46]

Animate your thoughts: Reconstruction of dynamic natural vision from human brain activity

Yizhuo Lu, Changde Du, Chong Wang, Xuanliu Zhu, Liuyun Jiang, Xujin Li, and Huiguang He. Animate your thoughts: Reconstruction of dynamic natural vision from human brain activity. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[47] [47]

Reanimating images using neural representations of dynamic stimuli

Jacob Yeung, Andrew F Luo, Gabriel Sarch, Margaret M Henderson, Deva Ramanan, and Michael J Tarr. Reanimating images using neural representations of dynamic stimuli. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5331–5343, 2025

2025

[48] [48]

Hierarchical structure guides rapid linguistic predictions during naturalistic listening.PloS one, 14(1):e0207741, 2019

Jonathan R Brennan and John T Hale. Hierarchical structure guides rapid linguistic predictions during naturalistic listening.PloS one, 14(1):e0207741, 2019

2019

[49] [49]

Introducing meg-masc a high-quality magneto-encephalography dataset for evaluating natural speech processing.Scientific data, 10(1):862, 2023

Laura Gwilliams, Graham Flick, Alec Marantz, Liina Pylkkänen, David Poeppel, and Jean-Rémi King. Introducing meg-masc a high-quality magneto-encephalography dataset for evaluating natural speech processing.Scientific data, 10(1):862, 2023. 12

2023

[50] [50]

A high-performance speech neuroprosthesis.Nature, 620(7976):1031–1036, 2023

Francis R Willett, Erin M Kunz, Chaofei Fan, Donald T Avansino, Guy H Wilson, Eun Young Choi, Foram Kamdar, Matthew F Glasser, Leigh R Hochberg, Shaul Druckmann, et al. A high-performance speech neuroprosthesis.Nature, 620(7976):1031–1036, 2023

2023

[51] [51]

Decoding intended speech with an intracortical brain-computer interface in a person with long-standing anarthria and locked-in syndrome.Cell Reports, 45(4), 2026

Justin J Jude, Stephanie Haro, Hadar Levi-Aharoni, Hiroaki Hashimoto, Alexander J Acosta, Nicholas S Card, Maitreyee Wairagkar, David M Brandman, Sergey D Stavisky, Ziv M Williams, et al. Decoding intended speech with an intracortical brain-computer interface in a person with long-standing anarthria and locked-in syndrome.Cell Reports, 45(4), 2026

2026

[52] [52]

Towards voice reconstruction from eeg during imagined speech

Young-Eun Lee, Seo-Hyun Lee, Sang-Ho Kim, and Seong-Whan Lee. Towards voice reconstruction from eeg during imagined speech. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 6030–6038, 2023

2023

[53] [53]

Imagined speech classification using eeg and deep learning.Bioengineering, 10(6):649, 2023

Mokhles M Abdulghani, Wilbur L Walters, and Khalid H Abed. Imagined speech classification using eeg and deep learning.Bioengineering, 10(6):649, 2023

2023

[54] [54]

Scaling law in neural data: Non-invasive speech decoding with 175 hours of eeg data.arXiv preprint arXiv:2407.07595, 2024

Motoshige Sato, Kenichi Tomeoka, Ilya Horiguchi, Kai Arulkumaran, Ryota Kanai, and Shuntaro Sasai. Scaling law in neural data: Non-invasive speech decoding with 175 hours of eeg data.arXiv preprint arXiv:2407.07595, 2024

work page arXiv 2024

[55] [55]

Mad: Multi-alignment meg-to-text decoding.arXiv preprint arXiv:2406.01512, 2024

Yiqian Yang, Hyejeong Jo, Yiqun Duan, Qiang Zhang, Jinni Zhou, Xuming Hu, Won Hee Lee, Renjing Xu, and Hui Xiong. Mad: Multi-alignment meg-to-text decoding.arXiv preprint arXiv:2406.01512, 2024

work page arXiv 2024

[56] [56]

Brainecho: Semantic brain signal decoding through vector-quantized spectrogram reconstruction for whisper-enhanced text generation

Jilong Li, Zhenxi Song, Jiaqi Wang, Meishan Zhang, Honghai Liu, Min Zhang, and Zhiguo Zhang. Brainecho: Semantic brain signal decoding through vector-quantized spectrogram reconstruction for whisper-enhanced text generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 2762–2778, 2025

2025

[57] [57]

Neuspeech: Decode neural signal as speech

Yiqian Yang, Yiqun Duan, Qiang Zhang, Hyejeong Jo, Jinni Zhou, Won Hee Lee, Renjing Xu, and Hui Xiong. Neuspeech: Decode neural signal as speech. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 6636–6640. IEEE, 2026

2026

[58] [58]

Towards sentence level imagined speech generation from eeg signals

Sparsh Rastogi, Harsh Dadwal, Khushboo Modi, Jatin Bedi, and Jasmeet Singh. Towards sentence level imagined speech generation from eeg signals. InProc. Interspeech 2025, pages 5558–5562, 2025

2025

[59] [59]

Are eeg-to-text models working?arXiv preprint arXiv:2405.06459, 2024

Hyejeong Jo, Yiqian Yang, Juhyeok Han, Yiqun Duan, Hui Xiong, and Won Hee Lee. Are eeg-to-text models working?arXiv preprint arXiv:2405.06459, 2024

work page arXiv 2024

[60] [60]

Deep speech 2: End-to-end speech recognition in english and mandarin

Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. InInternational conference on machine learning, pages 173–182. PMLR, 2016

2016

[61] [61]

Conformer: Convolution-augmented transformer for speech recognition

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented transformer for speech recognition. InProc. Interspeech 2020, pages 5036–5040, 2020

2020

[62] [62]

wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

2020

[63] [63]

Deep voice: Real-time neural text-to-speech

Sercan Ö Arık, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, et al. Deep voice: Real-time neural text-to-speech. In International conference on machine learning, pages 195–204. PMLR, 2017

2017

[64] [64]

Fastspeech: Fast, robust and controllable text to speech.Advances in neural information processing systems, 32, 2019

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech.Advances in neural information processing systems, 32, 2019

2019

[65] [65]

Fastspeech 2: Fast and high-quality end-to-end text to speech

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. InInternational Conference on Learning Representations

[66] [66]

Glow-tts: A generative flow for text-to- speech via monotonic alignment search.Advances in Neural Information Processing Systems, 33:8067– 8077, 2020

Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. Glow-tts: A generative flow for text-to- speech via monotonic alignment search.Advances in Neural Information Processing Systems, 33:8067– 8077, 2020

2020

[67] [67]

Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech

Jaehyeon Kim, Jungil Kong, and Juhee Son. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. InInternational conference on machine learning, pages 5530–5540. PMLR, 2021. 13

2021

[68] [68]

Bigvgan: A universal neural vocoder with large-scale training

Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. Bigvgan: A universal neural vocoder with large-scale training. InThe Eleventh International Conference on Learning Representations, 2023

2023

[69] [69]

Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis.Advances in neural information processing systems, 33:17022–17033, 2020

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis.Advances in neural information processing systems, 33:17022–17033, 2020

2020

[70] [70]

Mel-cepstral distance measure for objective speech quality assessment

Robert Kubichek. Mel-cepstral distance measure for objective speech quality assessment. InProceedings of IEEE pacific rim conference on communications computers and signal processing, volume 1, pages 125–128. IEEE, 1993

1993

[71] [71]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

2021

[72] [72]

Bertscore: Evaluating text generation with bert

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. InInternational Conference on Learning Representations, 2020

2020

[73] [73]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

2022

[74] [74]

Utmos: Utokyo-sarulab system for voicemos challenge 2022.Interspeech, 2022

Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. Utmos: Utokyo-sarulab system for voicemos challenge 2022.Interspeech, 2022. 14 A Additional Experiment Settings A.1 Datasets and Preprocessing Brennan EEG dataset.Brennan EEG dataset is a naturalistic listening EEG dataset introduced by Brennan and Hale...

2022