A cross-species neural foundation model for end-to-end speech decoding

Chaofei Fan; Francis R Willett; Han Yu; Jingyuan Li; Lea Duncker; Liam Paninski; Linyang He; Nima Mesgarani; Scott Linderman; Tingkai Liu

arxiv: 2511.21740 · v5 · pith:PWZJFKS5new · submitted 2025-11-21 · 💻 cs.CL · cs.AI

A cross-species neural foundation model for end-to-end speech decoding

Yizi Zhang , Linyang He , Chaofei Fan , Tingkai Liu , Han Yu , Trung Le , Jingyuan Li , Scott Linderman

show 4 more authors

Lea Duncker Francis R Willett Nima Mesgarani Liam Paninski

This is my paper

Pith reviewed 2026-05-17 19:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords brain-computer interfacespeech decodingend-to-end frameworkneural foundation modelcross-species pretrainingcontrastive learningaudio language models

0 comments

The pith

A cross-species pretrained neural encoder enables end-to-end decoding of brain activity into sentences at 10.22 percent word error rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an end-to-end BraIn-to-Text framework that translates neural signals directly into coherent sentences using one differentiable network instead of separate phoneme and language-model stages. A neural encoder pretrained across species and tasks provides representations that transfer to human attempted and imagined speech. When this encoder is aligned with audio large language models through contrastive learning, the word error rate drops from 24.69 percent to 10.22 percent on existing benchmarks. The same alignment also lets embeddings from attempted and imagined speech support generalization between the two tasks. This setup matters because it removes the need for hand-designed intermediate steps and opens the door to joint training of the full pipeline on larger neural datasets.

Core claim

A cross-task, cross-species pretrained neural encoder transfers representations to both attempted and imagined human speech and, when integrated end-to-end with audio large language models and trained with contrastive cross-modal alignment, reduces word error rate from 24.69 percent to 10.22 percent while also aligning embeddings to enable cross-task generalization.

What carries the argument

The cross-species pretrained neural encoder, whose learned representations transfer to human attempted and imagined speech recordings and support direct integration with audio language models.

If this is right

All decoding stages can be optimized jointly because the entire pipeline is a single differentiable network.
State-of-the-art results appear on the Brain-to-Text benchmarks even when the pretrained encoder is used only in a cascaded setting with an n-gram language model.
Small-scale audio large language models produce marked gains when paired with the aligned neural encoder.
Attempted and imagined speech embeddings become aligned enough to support generalization from one task to the other.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pretraining strategy might be applied to neural data from additional recording modalities or animal models to further improve transfer.
Collecting more diverse cross-species datasets could reduce performance gaps across different human users.
If the encoder scales with dataset size, longer and more naturalistic recordings might yield further error-rate reductions.

Load-bearing premise

Representations learned by the cross-species pretrained neural encoder transfer effectively to human attempted and imagined speech recordings without major domain shift.

What would settle it

Running the end-to-end BIT model on the Brain-to-Text '24 or '25 test sets and obtaining a word error rate above 15 percent would show the claimed reduction does not hold.

Figures

Figures reproduced from arXiv: 2511.21740 by Chaofei Fan, Francis R Willett, Han Yu, Jingyuan Li, Lea Duncker, Liam Paninski, Linyang He, Nima Mesgarani, Scott Linderman, Tingkai Liu, Trung Le, Yizi Zhang.

**Figure 1.** Figure 1: Schematic illustration of BIT. (A) BIT is an end-to-end speech decoding framework that translates neural activity directly into text by combining a cross-task, cross-species pretrained neural encoder with an audio-LLM decoder. The data are separately obtained and preprocessed from each study. (Appendix A). (B) The neural encoder is a transformer that embeds 20 ms bins of thresholded spikes and spike-band p… view at source ↗

**Figure 2.** Figure 2: Benchmarking BIT versus baselines in attempted and imagined speech decoding. (A) For attempted speech, the pretrained encoder (BIT-Human, BIT-All) outperforms RNN and BIT-TFS using both cascaded and end-to-end approaches. Bar plots show mean WER across competition holdout sentences. (B) For imagined speech (50-word vocabulary), BIT-All outperforms all other baselines in both cascaded and end-to-end setting… view at source ↗

**Figure 3.** Figure 3: LLM decoder ablation across modality, model size, prompt design, and contrastive learning usage. (A) For audio-LLMs, neural activity can be treated as either a neural or an audio modality. For neural modality, encoder outputs are projected directly into the text embedding space via an MLP projector. For audio modality, neural encoder outputs pass through the MLP projector followed by a multimodal projector… view at source ↗

**Figure 4.** Figure 4: BIT aligns attempted and imagined speech neural embeddings to enable cross-task generalization. (A) Representational similarity analysis (RSA) scores between neural and audio-LLM text embeddings. (B) PCA embeddings of neural features from participant T12 are visualized on the first two PCs. Word-level embeddings are averaged across time and trials and shown as dots. The same words are shown for both tasks… view at source ↗

**Figure 5.** Figure 5: Distribution of neural token lengths across sentences for RSA. We restrict RSA to sentences with token lengths between 45 and 80 (mean length ≈ 63) for participant T12 and between 120 and 200 (mean length ≈ 160) for participant T15, since neural embeddings are converted into fixed-length sentence vectors by dividing each sequence into ten temporal segments and concatenating their averages. Sequences that a… view at source ↗

**Figure 6.** Figure 6: Phoneme-level decoding error matrix. Predicted and ground truth phoneme sequences are aligned, correct matches are removed, and all remaining errors are normalized to percentages. Phonemes are ordered by total error frequency, with true phonemes on the horizontal axis and predicted phonemes on the vertical axis. The visualization highlights dominant substitution patterns and systematic decoding errors. Bas… view at source ↗

**Figure 7.** Figure 7: Word-level decoding error matrix. (a) Word level confusion matrix computed over all decoding errors. Words are arranged alphabetically from a to z for both axes. Because the vocabulary is large, the full matrix must be viewed at a very large scale to see individual pixels clearly. A clear diagonal structure appears, indicating that most decoding mistakes occur between words with similar spellings or phonol… view at source ↗

**Figure 8.** Figure 8: Impact of progressively increasing the proportion of human versus monkey pretraining data on attempted-speech decoding performance for participant T15. Across both cascaded and end-to-end models, “human” corresponds to the fraction of human pretraining data progressively added, whereas “monkey” indicates the fraction of monkey pretraining data added on top of the human data. filtered matrix is shown in [… view at source ↗

read the original abstract

Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end BraIn-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text '24 and '25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper cuts end-to-end WER from 24.69% to 10.22% on Brain-to-Text benchmarks by pretraining a neural encoder across species and aligning it contrastively to audio LLMs, but the specific value of the cross-species step still needs clearer isolation.

read the letter

The headline result is a solid drop in word error rate for end-to-end neural speech decoding. The authors take a prior end-to-end baseline and improve it to 10.22% WER by combining a cross-species pretrained encoder with contrastive alignment to audio large language models. They also claim a new SOTA in the cascaded setting with an n-gram LM and show that the same embeddings can be aligned across attempted and imagined speech for cross-task generalization.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BIT, an end-to-end BraIn-to-Text framework for speech brain-computer interfaces that translates neural activity into coherent sentences via a single differentiable network. It centers on a cross-task, cross-species pretrained neural encoder whose representations are claimed to transfer to both attempted and imagined speech; in cascaded settings with an n-gram LM this yields a new SOTA on the Brain-to-Text '24 and '25 benchmarks, while end-to-end integration with audio LLMs via contrastive cross-modal alignment reduces WER from 24.69% to 10.22%.

Significance. If the transfer and attribution claims hold, the work would be significant for BCI research by demonstrating joint optimization across decoding stages and effective use of diverse cross-species neural data. The integration of small-scale audio LLMs, contrastive alignment, and cross-task embedding alignment for attempted/imagined speech generalization are concrete strengths that could support more robust, scalable systems.

major comments (2)

[Abstract] Abstract: the headline claim that the cross-species pretrained neural encoder supplies representations that transfer effectively to human attempted and imagined speech recordings (and thereby drive the reported WER drop) is load-bearing, yet no ablation (e.g., frozen vs. fine-tuned encoder, single-species vs. cross-species pretraining) or domain-shift metric (e.g., embedding similarity across species or modalities) is supplied to isolate its contribution from the contrastive LLM alignment or end-to-end differentiability.
[Abstract] Abstract: the specific WER reduction (24.69% to 10.22%) and new SOTA statements are presented without reference to data splits, statistical significance tests, run-to-run variance, or the exact prior end-to-end baseline paper, preventing verification that the gains are reproducible and attributable to the described components.

minor comments (2)

[Abstract] The abstract mentions 'Brain-to-Text '24 and '25 benchmarks' and 'prior end-to-end method' without citing the specific references or dataset papers.
Notation for the overall BIT architecture, contrastive loss, and cross-modal alignment objective would benefit from an explicit equation or high-level diagram for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of the BIT framework, particularly the cross-species pretraining, end-to-end differentiability, and contrastive alignment with audio LLMs. We address each major comment below with specific plans for revision. Our responses focus on clarifying and strengthening the manuscript without overstating current results.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that the cross-species pretrained neural encoder supplies representations that transfer effectively to human attempted and imagined speech recordings (and thereby drive the reported WER drop) is load-bearing, yet no ablation (e.g., frozen vs. fine-tuned encoder, single-species vs. cross-species pretraining) or domain-shift metric (e.g., embedding similarity across species or modalities) is supplied to isolate its contribution from the contrastive LLM alignment or end-to-end differentiability.

Authors: We agree that explicit isolation of the pretrained encoder's contribution strengthens the claims. The full manuscript reports performance gains on attempted and imagined speech tasks when using the cross-species encoder, but does not include the requested ablations or quantitative domain-shift metrics. We will add these in the revised version: (1) frozen vs. fine-tuned encoder comparisons, (2) single-species vs. cross-species pretraining ablations, and (3) embedding similarity metrics (e.g., cosine similarity and domain discrepancy measures) across species and modalities. These will be placed in a new subsection of the experiments to better attribute gains to the encoder versus contrastive alignment or end-to-end training. revision: yes
Referee: [Abstract] Abstract: the specific WER reduction (24.69% to 10.22%) and new SOTA statements are presented without reference to data splits, statistical significance tests, run-to-run variance, or the exact prior end-to-end baseline paper, preventing verification that the gains are reproducible and attributable to the described components.

Authors: We acknowledge this omission limits immediate verifiability. The manuscript references the Brain-to-Text '24 and '25 benchmarks (which use fixed public splits), but does not detail them in the abstract or results, nor include significance tests or variance. We will revise to explicitly state the data splits, report run-to-run standard deviation over multiple random seeds, include statistical significance (e.g., paired t-tests), and cite the precise prior end-to-end baseline paper. These details will appear in the results section, with a brief mention added to the abstract. revision: yes

Circularity Check

0 steps flagged

Performance reported against external benchmarks and prior baselines; derivation chain contains no self-referential reductions or load-bearing self-citations.

full rationale

The abstract and reported results compare WER (24.69% to 10.22%) and SOTA status directly to external Brain-to-Text '24/'25 benchmarks and a prior end-to-end method. The cross-species encoder is described as transferring representations, but this is an empirical claim evaluated on held-out human data rather than a quantity defined in terms of itself. No equations, fitted parameters renamed as predictions, or self-citation chains that close the central argument are present in the provided text. The approach is therefore self-contained against external benchmarks, consistent with a low circularity finding.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central performance claims rest on standard supervised and contrastive neural-network training assumptions plus the transferability of representations across species and tasks; no new physical entities or ad-hoc constants are introduced beyond typical deep-learning hyperparameters.

free parameters (2)

contrastive loss temperature and weighting
Hyperparameters controlling alignment strength between neural and audio embeddings, chosen during training.
pretraining dataset mixing ratios across species and tasks
Weights used to combine neural recordings from different animals and experimental paradigms.

axioms (2)

domain assumption Neural activity patterns share transferable statistical structure across species and between attempted versus imagined speech.
Invoked to justify cross-species pretraining and cross-task alignment.
standard math Standard back-propagation and stochastic gradient descent converge to useful representations for this decoding task.
Background assumption for all neural network training described.

pith-pipeline@v0.9.0 · 5575 in / 1605 out tokens · 31906 ms · 2026-05-17T19:56:23.299472+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

transformer neural encoder pretrained with self-supervised masked modeling on 367 hours of Utah array recordings... trained with contrastive learning for cross-modal alignment
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cross-species, cross-task pretrained neural encoder... 8-tick periodic micro-structure absent

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DANCE: Detect and Classify Events in EEG
cs.LG 2026-05 unverdicted novelty 6.0

DANCE frames EEG event identification as a set-prediction problem to jointly detect and classify events directly from raw, unaligned signals, outperforming existing methods on seizure monitoring and matching onset-inf...
MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis
q-bio.NC 2026-04 unverdicted novelty 6.0

MoDAl discovers complementary neurolinguistic modalities via contrastive-decorrelation objectives, cutting brain-to-text word error rate from 26.3% to 21.6% by incorporating area 44 signals.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 2 Pith papers · 8 internal anchors

[1]

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Time-masked trans- formers with lightweight test-time adaptation for neural speech decoding.arXiv preprint arXiv:2507.02800,

Ebrahim Feghhi, Shreyas Kaasyap, Nima Hadidi, and Jonathan C Kao. Time-masked trans- formers with lightweight test-time adaptation for neural speech decoding.arXiv preprint arXiv:2507.02800,

work page arXiv
[3]

Towards an end-to-end framework for invasive brain signal decoding with large language models.arXiv preprint arXiv:2406.11568,

Sheng Feng, Heyang Liu, Yu Wang, and Yanfeng Wang. Towards an end-to-end framework for invasive brain signal decoding with large language models.arXiv preprint arXiv:2406.11568,

work page arXiv
[4]

The Curious Case of Neural Text Degeneration

URL https://arxiv.org/abs/1904.09751. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[5]

06.487388.URL https://www

doi: 10.1101/2022.04. 06.487388.URL https://www. biorxiv. org/content/10.1101/2022.04, 6:v2. Brianna M Karpowicz, Joel Ye, Chaofei Fan, Pablo Tostado-Marcos, Fabio Rizzoglio, Clay Wash- ington, Thiago Scodeler, Diogo de Lucena, Samuel R Nason-Tomaszewski, Matthew J Mender, et al. Few-shot algorithms for consistent neural decoding (falcon) benchmark.Advanc...

work page doi:10.1101/2022.04 2022
[6]

Spint: Spatial permutation-invariant neural transformer for consistent intracortical motor decoding.arXiv preprint arXiv:2507.08402,

Trung Le, Hao Fang, Jingyuan Li, Tung Nguyen, Lu Mi, Amy Orsborn, Uygar S ¨umb¨ul, and Eli Shlizerman. Spint: Spatial permutation-invariant neural transformer for consistent intracortical motor decoding.arXiv preprint arXiv:2507.08402,

work page arXiv
[8]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

URLhttps://arxiv.org/abs/2301.12597. Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E Gonzalez, and Ion Stoica. Tune: A research platform for distributed model selection and training. InProceedings of the 2018 Workshop on ML Systems at NeurIPS,

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Llava: Large language and vision as- sistant.arXiv preprint arXiv:2304.08485, 2023a. URLhttps://arxiv.org/abs/2304. 08485. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023b. 11 Preprint Ilya Los...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Generalization in data-driven models of primary visual cortex.BioRxiv, pp

Konstantin-Klemens Lurz, Mohammad Bashiri, Konstantin Willeke, Akshay K Jagadish, Eric Wang, Edgar Y Walker, Santiago A Cadena, Taliah Muhammad, Erick Cobos, Andreas S Tolias, et al. Generalization in data-driven models of primary visual cortex.BioRxiv, pp. 2020–10,

work page 2020
[11]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predic- tive coding. InarXiv preprint arXiv:1807.03748,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Qwen2.5 Technical Report

URLhttps://arxiv.org/abs/2412.15115. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision.arXiv preprint arXiv:2103.00020,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Generalizable, real-time neural decoding with hybrid state-space models.arXiv preprint arXiv:2506.05320,

Avery Hee-Woon Ryoo, Nanda H Krishna, Ximeng Mao, Mehdi Azabou, Eva L Dyer, Matthew G Perich, and Guillaume Lajoie. Generalizable, real-time neural decoding with hybrid state-space models.arXiv preprint arXiv:2506.05320,

work page arXiv
[14]

Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207, 2024

Paul S Scotti, Mihir Tripathy, Cesar Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A Norman, et al. Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207,

work page arXiv
[15]

Brain-to-text benchmark’24: Lessons learned

Francis R Willett, Jingyuan Li, Trung Le, Chaofei Fan, Mingfei Chen, Eli Shlizerman, Yue Chen, Xin Zheng, Tatsuo S Okubo, Tyler Benster, et al. Brain-to-text benchmark’24: Lessons learned. arXiv preprint arXiv:2412.17227,

work page arXiv
[16]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

In vivo cell-type and brain region classification via multimodal contrastive learning.bioRxiv, pp

Han Yu, Hanrui Lyu, Ethan Yixun Xu, Charlie Windolf, Eric Kenji Lee, Fan Yang, Andrew M Shelton, Shawn Olsen, Sahar Minavi, Olivier Winter, et al. In vivo cell-type and brain region classification via multimodal contrastive learning.bioRxiv, pp. 2024–11,

work page 2024
[19]

CoCa: Contrastive Captioners are Image-Text Foundation Models

URLhttps://arxiv.org/abs/2205.01917. Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christo- pher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Neural encoding and decoding at scale.arXiv preprint arXiv:2504.08201,

Yizi Zhang, Yanchen Wang, Mehdi Azabou, Alexandre Andre, Zixuan Wang, Hanrui Lyu, The International Brain Laboratory, Eva Dyer, Liam Paninski, and Cole Hurwitz. Neural encoding and decoding at scale.arXiv preprint arXiv:2504.08201,

work page arXiv
[21]

A.1 HUMAN DATA Willett et al

13 Preprint A DATASET DETAILS This section documents the studies and sources of all datasets used for pretraining, as well as for attempted and imagined speech decoding, and provides a brief description of each. A.1 HUMAN DATA Willett et al. (2021).This dataset contains recordings from a participant with hand paralysis who attempted and imagined handwriti...

work page 2021
[22]

Neural Encoder Attempted Speech Imagined Speech T12 T15 T12 T15 RNN 18.67% 9.64% 30.81% 24.56% BIT-TFS 17.26% 8.87% 25.01% 21.21% BIT-Human 15.95% 7.61% 19.63% 18.83% BIT-All 14.39% 7.12% 18.08% 17.94% BIT-Cross-Task-Only – – 20.46% 19.58% Table 4:Phoneme decoding benchmark.The metrics shown are the validation PER. In the end-to-end model, phonemes serve ...

work page 2024
[23]

We also includeQwen2.5-7B

is a mid-sized LLM trained with an improved pretraining pipeline and instruc- tion tuning. We also includeQwen2.5-7B. To examine the effect of model size and architectural advances, we include two recent models from the Qwen3 series:Qwen3-0.6B(Yang et al., 2025), a compact model optimized for efficiency, andQwen3-1.7B, a larger variant designed to provide...

work page 2025
[24]

UCD-NPL causal RNN + 5gram

extends Qwen2.5-1.5B with an audio front-end, enabling the model to process acoustic representations alongside text. To understand the effect of model scale in the audio domain, we also includeQwen2-Audio 7B(Chu et al., 2024), a larger audio-based LLM capturing richer acoustic and semantic features. Comparing Aero1-Audio 1.5B and Qwen2-Audio 7B with text-...

work page 2024
[25]

To compare representational structures, we extract the upper-triangular entries of each RDM and compute the Pearson correlation coefficient between neural and LLM RDMs

using one minus the cosine similarity. To compare representational structures, we extract the upper-triangular entries of each RDM and compute the Pearson correlation coefficient between neural and LLM RDMs. The resulting RSA score quantifies how well the geometry of neural embeddings aligns with lan- guage structures in LLMs. Framing RSA as an interpreta...

work page 2023
[26]

When controlling for data size, we find no substantial performance difference between SL and SSL pretraining for imagined speech decoding. T12 (Cascaded) T12 (End-to-End) BIT-Cross-Task-Only 12.53% 15.71% BIT-SameParticipant-SSL 12.67% 15.64% Table 9:Impact of SL versus SSL pretraining using equal amounts of human speech data on imagined speech decoding p...

work page 2025
[27]

speech data from participants T12 and T15

to randomly sample 30 optimizer hyperparameter (batch size, weight decay, and learning rate) combinations from the ranges listed in Table 12, using attempted 23 Preprint Hyperparameter Value Embedding Dimension 384 Head Dimension 512 Number of Heads 6 Depth 7 Mask Ratio 0.5 (T12) and 0 (T15) Max Mask Time Span 15 Patch Size 5 Dropout Rate 0.2 Bidrectional...

work page 2017

[1] [1]

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Time-masked trans- formers with lightweight test-time adaptation for neural speech decoding.arXiv preprint arXiv:2507.02800,

Ebrahim Feghhi, Shreyas Kaasyap, Nima Hadidi, and Jonathan C Kao. Time-masked trans- formers with lightweight test-time adaptation for neural speech decoding.arXiv preprint arXiv:2507.02800,

work page arXiv

[3] [3]

Towards an end-to-end framework for invasive brain signal decoding with large language models.arXiv preprint arXiv:2406.11568,

Sheng Feng, Heyang Liu, Yu Wang, and Yanfeng Wang. Towards an end-to-end framework for invasive brain signal decoding with large language models.arXiv preprint arXiv:2406.11568,

work page arXiv

[4] [4]

The Curious Case of Neural Text Degeneration

URL https://arxiv.org/abs/1904.09751. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[5] [5]

06.487388.URL https://www

doi: 10.1101/2022.04. 06.487388.URL https://www. biorxiv. org/content/10.1101/2022.04, 6:v2. Brianna M Karpowicz, Joel Ye, Chaofei Fan, Pablo Tostado-Marcos, Fabio Rizzoglio, Clay Wash- ington, Thiago Scodeler, Diogo de Lucena, Samuel R Nason-Tomaszewski, Matthew J Mender, et al. Few-shot algorithms for consistent neural decoding (falcon) benchmark.Advanc...

work page doi:10.1101/2022.04 2022

[6] [6]

Spint: Spatial permutation-invariant neural transformer for consistent intracortical motor decoding.arXiv preprint arXiv:2507.08402,

Trung Le, Hao Fang, Jingyuan Li, Tung Nguyen, Lu Mi, Amy Orsborn, Uygar S ¨umb¨ul, and Eli Shlizerman. Spint: Spatial permutation-invariant neural transformer for consistent intracortical motor decoding.arXiv preprint arXiv:2507.08402,

work page arXiv

[7] [8]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

URLhttps://arxiv.org/abs/2301.12597. Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E Gonzalez, and Ion Stoica. Tune: A research platform for distributed model selection and training. InProceedings of the 2018 Workshop on ML Systems at NeurIPS,

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [9]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Llava: Large language and vision as- sistant.arXiv preprint arXiv:2304.08485, 2023a. URLhttps://arxiv.org/abs/2304. 08485. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023b. 11 Preprint Ilya Los...

work page internal anchor Pith review Pith/arXiv arXiv

[9] [10]

Generalization in data-driven models of primary visual cortex.BioRxiv, pp

Konstantin-Klemens Lurz, Mohammad Bashiri, Konstantin Willeke, Akshay K Jagadish, Eric Wang, Edgar Y Walker, Santiago A Cadena, Taliah Muhammad, Erick Cobos, Andreas S Tolias, et al. Generalization in data-driven models of primary visual cortex.BioRxiv, pp. 2020–10,

work page 2020

[10] [11]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predic- tive coding. InarXiv preprint arXiv:1807.03748,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [12]

Qwen2.5 Technical Report

URLhttps://arxiv.org/abs/2412.15115. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision.arXiv preprint arXiv:2103.00020,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [13]

Generalizable, real-time neural decoding with hybrid state-space models.arXiv preprint arXiv:2506.05320,

Avery Hee-Woon Ryoo, Nanda H Krishna, Ximeng Mao, Mehdi Azabou, Eva L Dyer, Matthew G Perich, and Guillaume Lajoie. Generalizable, real-time neural decoding with hybrid state-space models.arXiv preprint arXiv:2506.05320,

work page arXiv

[13] [14]

Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207, 2024

Paul S Scotti, Mihir Tripathy, Cesar Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A Norman, et al. Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data.arXiv preprint arXiv:2403.11207,

work page arXiv

[14] [15]

Brain-to-text benchmark’24: Lessons learned

Francis R Willett, Jingyuan Li, Trung Le, Chaofei Fan, Mingfei Chen, Eli Shlizerman, Yue Chen, Xin Zheng, Tatsuo S Okubo, Tyler Benster, et al. Brain-to-text benchmark’24: Lessons learned. arXiv preprint arXiv:2412.17227,

work page arXiv

[15] [16]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [17]

In vivo cell-type and brain region classification via multimodal contrastive learning.bioRxiv, pp

Han Yu, Hanrui Lyu, Ethan Yixun Xu, Charlie Windolf, Eric Kenji Lee, Fan Yang, Andrew M Shelton, Shawn Olsen, Sahar Minavi, Olivier Winter, et al. In vivo cell-type and brain region classification via multimodal contrastive learning.bioRxiv, pp. 2024–11,

work page 2024

[17] [19]

CoCa: Contrastive Captioners are Image-Text Foundation Models

URLhttps://arxiv.org/abs/2205.01917. Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christo- pher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [20]

Neural encoding and decoding at scale.arXiv preprint arXiv:2504.08201,

Yizi Zhang, Yanchen Wang, Mehdi Azabou, Alexandre Andre, Zixuan Wang, Hanrui Lyu, The International Brain Laboratory, Eva Dyer, Liam Paninski, and Cole Hurwitz. Neural encoding and decoding at scale.arXiv preprint arXiv:2504.08201,

work page arXiv

[19] [21]

A.1 HUMAN DATA Willett et al

13 Preprint A DATASET DETAILS This section documents the studies and sources of all datasets used for pretraining, as well as for attempted and imagined speech decoding, and provides a brief description of each. A.1 HUMAN DATA Willett et al. (2021).This dataset contains recordings from a participant with hand paralysis who attempted and imagined handwriti...

work page 2021

[20] [22]

Neural Encoder Attempted Speech Imagined Speech T12 T15 T12 T15 RNN 18.67% 9.64% 30.81% 24.56% BIT-TFS 17.26% 8.87% 25.01% 21.21% BIT-Human 15.95% 7.61% 19.63% 18.83% BIT-All 14.39% 7.12% 18.08% 17.94% BIT-Cross-Task-Only – – 20.46% 19.58% Table 4:Phoneme decoding benchmark.The metrics shown are the validation PER. In the end-to-end model, phonemes serve ...

work page 2024

[21] [23]

We also includeQwen2.5-7B

is a mid-sized LLM trained with an improved pretraining pipeline and instruc- tion tuning. We also includeQwen2.5-7B. To examine the effect of model size and architectural advances, we include two recent models from the Qwen3 series:Qwen3-0.6B(Yang et al., 2025), a compact model optimized for efficiency, andQwen3-1.7B, a larger variant designed to provide...

work page 2025

[22] [24]

UCD-NPL causal RNN + 5gram

extends Qwen2.5-1.5B with an audio front-end, enabling the model to process acoustic representations alongside text. To understand the effect of model scale in the audio domain, we also includeQwen2-Audio 7B(Chu et al., 2024), a larger audio-based LLM capturing richer acoustic and semantic features. Comparing Aero1-Audio 1.5B and Qwen2-Audio 7B with text-...

work page 2024

[23] [25]

To compare representational structures, we extract the upper-triangular entries of each RDM and compute the Pearson correlation coefficient between neural and LLM RDMs

using one minus the cosine similarity. To compare representational structures, we extract the upper-triangular entries of each RDM and compute the Pearson correlation coefficient between neural and LLM RDMs. The resulting RSA score quantifies how well the geometry of neural embeddings aligns with lan- guage structures in LLMs. Framing RSA as an interpreta...

work page 2023

[24] [26]

When controlling for data size, we find no substantial performance difference between SL and SSL pretraining for imagined speech decoding. T12 (Cascaded) T12 (End-to-End) BIT-Cross-Task-Only 12.53% 15.71% BIT-SameParticipant-SSL 12.67% 15.64% Table 9:Impact of SL versus SSL pretraining using equal amounts of human speech data on imagined speech decoding p...

work page 2025

[25] [27]

speech data from participants T12 and T15

to randomly sample 30 optimizer hyperparameter (batch size, weight decay, and learning rate) combinations from the ranges listed in Table 12, using attempted 23 Preprint Hyperparameter Value Embedding Dimension 384 Head Dimension 512 Number of Heads 6 Depth 7 Mask Ratio 0.5 (T12) and 0 (T15) Max Mask Time Span 15 Patch Size 5 Dropout Rate 0.2 Bidrectional...

work page 2017