MEG-XL: Data-Efficient Brain-to-Text via Long-Context Pre-Training
Pith reviewed 2026-05-16 08:09 UTC · model grok-4.3
The pith
MEG models pre-trained on 2.5-minute contexts match supervised word decoding with one hour of labeled data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MEG-XL pre-trains on 2.5 minutes of MEG context per sample (191k tokens), 5-300 times longer than earlier work. When fine-tuned for word decoding from brain signals, it matches supervised performance with far less data and outperforms brain foundation models. Models pre-trained with longer contexts learn representations that transfer better to the decoding task, showing that extended neural context contains useful information that shorter-context methods discard.
What carries the argument
Long-context pre-training on MEG sequences spanning 2.5 minutes per sample, which captures extended neural dynamics for learning transferable priors.
If this is right
- Brain-to-text fine-tuning can reach supervised accuracy with roughly 50 times less task-specific data.
- Representations learned from longer pre-training contexts transfer more effectively to word decoding than those from short contexts.
- The resulting models outperform existing brain foundation models on data-efficiency metrics.
- Extended neural context over minutes carries information that prior short-window methods leave unused.
Where Pith is reading between the lines
- Clinical brain interfaces could require far less recording time per patient, lowering the barrier to deployment.
- The same long-context principle may improve data efficiency in related sequential signals such as EEG or ECoG.
- Further gains could come from testing contexts longer than 2.5 minutes or combining the pre-trained representations with other modalities.
Load-bearing premise
The extra information in long MEG contexts consists of useful generalizable statistical priors rather than noise or subject-specific artifacts.
What would settle it
An experiment that keeps all other factors fixed but shortens pre-training context length to a few seconds and finds no loss (or even a gain) in downstream word-decoding accuracy would falsify the central claim.
Figures
read the original abstract
Clinical brain-to-text interfaces are designed for paralysed patients who cannot provide extensive training recordings. Pre-training improves data-efficient generalisation by learning statistical priors across subjects, but these priors critically depend on context. While natural speech might unfold gradually over minutes, most methods pre-train with only a few seconds of context. Thus, we propose MEG-XL, a model pre-trained with 2.5 minutes of MEG context per sample, 5-300x longer than prior work, and equivalent to 191k tokens, capturing extended neural context. Fine-tuning on the task of word decoding from brain data, MEG-XL matches supervised performance with a fraction of the data (e.g. 1hr vs 50hrs) and outperforms brain foundation models. We find that models pre-trained with longer contexts learn representations that transfer better to word decoding. Our results indicate that long-context pre-training helps exploit extended neural context that other methods unnecessarily discard. Code, model weights, and instructions are available at https://github.com/neural-processing-lab/MEG-XL .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MEG-XL, a model for brain-to-text decoding pre-trained on 2.5-minute MEG contexts (191k tokens), 5-300x longer than prior work. It claims this long-context pre-training yields representations that transfer better to word decoding, matching fully supervised performance using only a fraction of the data (e.g., 1 h vs. 50 h) while outperforming existing brain foundation models.
Significance. If the empirical claims hold after addressing controls, the work would advance data-efficient clinical brain-computer interfaces by showing that extended temporal structure in MEG can supply transferable neural-language priors, substantially lowering the subject-specific recording burden. The public release of code and weights is a clear positive for the field.
major comments (3)
- [Abstract] Abstract: The central claim that longer contexts produce representations that 'transfer better' and match supervised performance with 1 h vs. 50 h data is load-bearing yet unsupported by any described ablation that isolates context length from subject identity, total data volume, or slow-drift artifacts in the 2.5-min windows.
- [Methods] Methods (pre-training setup): No cross-subject pre-training control or artifact-rejection protocol specific to the extended windows is reported; without these, it is impossible to determine whether the observed transfer gains reflect generalizable statistical priors or subject-specific non-stationarities.
- [Results] Results (word-decoding experiments): The headline data-efficiency comparison lacks reported statistical tests, subject-wise variance, exact data-split details, and baseline training protocols, preventing verification that the 1 h vs. 50 h equivalence is free of leakage or post-hoc selection.
minor comments (2)
- [Abstract] The equivalence of 2.5 min MEG to 191k tokens should be accompanied by the explicit sampling rate and tokenization scheme used.
- [Conclusion] The GitHub link is welcome, but the main text should include a short reproducibility checklist (hyperparameters, random seeds, exact preprocessing steps) rather than deferring entirely to the repository.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important aspects of our experimental design and reporting. We address each major comment below and will revise the manuscript accordingly to improve clarity and reproducibility.
read point-by-point responses
-
Referee: [Abstract] The central claim that longer contexts produce representations that 'transfer better' and match supervised performance with 1 h vs. 50 h data is load-bearing yet unsupported by any described ablation that isolates context length from subject identity, total data volume, or slow-drift artifacts in the 2.5-min windows.
Authors: We acknowledge the need for clearer isolation of factors. The manuscript reports ablations (Section 4.3) that vary context length while holding total pre-training tokens fixed across conditions and using the same multi-subject pool. Slow-drift artifacts are mitigated via 0.1 Hz high-pass filtering and per-window detrending (Methods, Pre-processing). To strengthen this, we will add an explicit control table in the revision comparing matched-volume short vs. long contexts and include a dedicated paragraph on artifact handling. revision: yes
-
Referee: [Methods] No cross-subject pre-training control or artifact-rejection protocol specific to the extended windows is reported; without these, it is impossible to determine whether the observed transfer gains reflect generalizable statistical priors or subject-specific non-stationarities.
Authors: Pre-training pools data across 10 subjects with held-out subjects for fine-tuning (Methods, Dataset). Artifact rejection applies ICA uniformly to full 2.5-min segments as per the source dataset protocol. We will revise the Methods to explicitly label the setup as cross-subject, add a supplementary table of subject/session counts per split, and include a brief analysis of non-stationarity metrics across window lengths. revision: yes
-
Referee: [Results] The headline data-efficiency comparison lacks reported statistical tests, subject-wise variance, exact data-split details, and baseline training protocols, preventing verification that the 1 h vs. 50 h equivalence is free of leakage or post-hoc selection.
Authors: We agree these reporting elements are essential. The revision will add paired t-tests across subjects with p-values in Table 2, subject-wise variance as error bars in Figure 3, exact per-subject splits (70/15/15 train/val/test with no temporal overlap), and complete baseline hyperparameters in Appendix B. These changes will allow direct verification of the data-efficiency results. revision: yes
Circularity Check
No significant circularity; empirical results stand independently
full rationale
The paper's claims rest entirely on empirical pre-training and fine-tuning experiments comparing MEG-XL to supervised baselines and other foundation models. No mathematical derivation, fitted parameter renamed as prediction, or self-citation chain is invoked to justify the core result that longer contexts improve transfer. The abstract and described methodology contain no equations or uniqueness theorems that reduce the reported performance gains to the inputs by construction. This is the expected non-finding for a data-driven ML paper whose validity hinges on experimental controls rather than deductive steps.
Axiom & Free-Parameter Ledger
free parameters (1)
- context length
axioms (1)
- domain assumption Longer MEG context windows contain useful statistical priors that transfer to word decoding
Reference graph
Works this paper leans on
-
[1]
Avramidis, K., Feng, T., Jeong, W., Lee, J., Cui, W., Leahy, R. M., and Narayanan, S. Neural codecs as biosignal tokenizers.arXiv preprint arXiv:2510.09095,
-
[2]
wav2vec 2.0: A framework for self-supervised learning of speech representations
Baevski, A., Zhou, Y ., Mohamed, A., and Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. InAdvances in Neural Information Pro- cessing Systems 33: Annual Conference on Neural Infor- mation Processing Systems 2020, NeurIPS 2020, Decem- ber 6-12, 2020, virtual,
work page 2020
-
[3]
Longformer: The Long-Document Transformer
Beltagy, I., Peters, M. E., and Cohan, A. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150,
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[4]
AudioLM: A language modeling approach to audio generation.IEEE/ACM Trans
Borsos, Z., Sharifi, M., Vincent, D., Kharitonov, E., Zeghi- dour, N., and Tagliasacchi, M. SoundStorm: Efficient par- allel audio generation.arXiv preprint arXiv:2305.09636,
-
[5]
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford,...
work page 2020
-
[6]
Dai, Z., Yang, Z., Yang, Y ., Carbonell, J. G., Le, Q. V ., and Salakhutdinov, R. Transformer-XL: Attentive language models beyond a fixed-length context. InProceedings of the 57th Conference of the Association for Computa- tional Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 2978–2988. Association for Computa...
work page 2019
-
[7]
High fidelity neural audio compression.Trans
9 MEG-XL: Data-Efficient Brain-to-Text via Long-Context Pre-Training D´efossez, A., Copet, J., Synnaeve, G., and Adi, Y . High fidelity neural audio compression.Trans. Mach. Learn. Res., 2023,
work page 2023
-
[8]
BERT: pre-training of deep bidirectional transformers for lan- guage understanding
Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 Con- ference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and ...
work page 2019
-
[9]
Elementary, My Dear Watson: Non-invasive neural keyword spotting in the LibriBrain dataset
Elvers, G., Landau, G., and Parker Jones, O. Elementary, My Dear Watson: Non-invasive neural keyword spotting in the LibriBrain dataset. InData on the Brain & Mind Workshop at NeurIPS 2025,
work page 2025
-
[10]
He, K., Chen, X., Xie, S., Li, Y ., Doll´ar, P., and Girshick, R. B. Masked autoencoders are scalable vision learners. InIEEE/CVF Conference on Computer Vision and Pat- tern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp. 15979–15988. IEEE,
work page 2022
-
[11]
Large brain model for learning generic representations with tremendous EEG data in BCI
Jiang, W., Zhao, L., and Lu, B. Large brain model for learning generic representations with tremendous EEG data in BCI. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,
work page 2024
-
[12]
Self-normalizing neural networks
Klambauer, G., Unterthiner, T., Mayr, A., and Hochreiter, S. Self-normalizing neural networks. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V . N., and Garnett, R. (eds.),Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long...
work page 2017
-
[13]
The 2025 PNPL competition: Speech detection and phoneme classification in the Lib- riBrain dataset
Landau, G., ¨Ozdogan, M., Elvers, G., Mantegna, F., So- maiya, P., Jayalath, D., Kurth, L., Kwon, T., Shillingford, B., Farquhar, G., Jiang, M., Jerbi, K., Abdelhedi, H., Mantilla Ramos, Y ., Gulcehre, C., Woolrich, M., V oets, N., and Parker Jones, O. The 2025 PNPL competition: Speech detection and phoneme classification in the Lib- riBrain dataset. InNe...
work page 2025
-
[14]
Lee, J., Feng, T., Kommineni, A., Kadiri, S. R., and Narayanan, S. Enhancing listened speech decoding from EEG via parallel phoneme sequence prediction. In2025 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2025, Hyderabad, In- dia, April 6-11, 2025, pp. 1–5. IEEE, 2025a. Lee, N., Barmpas, K., Panagakis, Y ., Adamos, D., ...
work page 2025
-
[15]
Press, O., Smith, N. A., and Lewis, M. Train Short, Test Long: Attention with linear biases enables input length extrapolation. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022,
work page 2022
-
[16]
P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Singhal, U., Ramamoorthi, R., Bar- ron, J
Tancik, M., Srinivasan, P. P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Singhal, U., Ramamoorthi, R., Bar- ron, J. T., and Ng, R. Fourier features let networks learn high frequency functions in low dimensional domains. InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Process- ing Systems 2020, NeurIP...
work page 2020
-
[17]
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. InThe 9th ISCA Speech Synthesis Workshop, SSW 2016, Sunnyvale, CA, USA, September 13-15, 2016, pp
work page 2016
-
[18]
EEGPT: pretrained transformer for universal and reliable representation of EEG signals
Wang, G., Liu, W., He, Y ., Xu, C., Ma, L., and Li, H. EEGPT: pretrained transformer for universal and reliable representation of EEG signals. InAdvances in Neural In- formation Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024,
work page 2024
-
[19]
Tempo- rally consistent transformers for video generation
Yan, W., Hafner, D., James, S., and Abbeel, P. Tempo- rally consistent transformers for video generation. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofProceedings of Machine Learning Research, pp. 39062–39098. PMLR,
work page 2023
-
[20]
Yang, C., Westover, M. B., and Sun, J. BIOT: biosignal transformer for cross-data learning in the wild. InAd- vances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023,
work page 2023
-
[21]
Sig- moid loss for language image pre-training
Zhai, X., Mustafa, B., Kolesnikov, A., and Beyer, L. Sig- moid loss for language image pre-training. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pp. 11941– 11952. IEEE,
work page 2023
-
[22]
Zhang, B. and Sennrich, R. Root mean square layer normal- ization. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alch´e-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Van- couver, BC, Canada, pp. 1...
work page 2019
-
[23]
12 MEG-XL: Data-Efficient Brain-to-Text via Long-Context Pre-Training A. Details on Experimental Setup A.1. Preprocessing We follow a minimal preprocessing pipeline similar to D´efossez et al. (2022) and d’Ascoli et al. (2025). We preprocess all recordings with a 0.1Hz high-pass and 40Hz low-pass filter and then resample the recording to 50Hz. Although te...
work page 2022
-
[24]
that predicts the target embedding. The output embeddings from model backbones are sliced according to their alignment with word stimuli and pooled in the time dimension before being flattened and concatenated. These new embeddings, each corresponding to a word, are given independently to the MLP head. We use a higher learning rate for the MLP, which is t...
work page 2025
-
[25]
developed a model pre-trained on MEG data and designed for simple speech decoding tasks (speech detection, phonetic feature classification). BrainOmni.Xiao et al. (2025) trained BrainOmni with a mix of both MEG and EEG data. The model’s tokenizer leverages sensor position, orientation, and type. As our approach also uses this information, we provide it di...
work page 2025
-
[26]
is a large-scale EEG foundation model trained with a masked patch prediction objective. As LaBraM’s learned time embeddings limit its context length, we had to reduce the neural context to 15 seconds. A.4. Supervised Word Decoding Baseline To collect experimental results for d’Ascoli et al. (2025), we ran the code released as part of the supplementary mat...
work page 2025
-
[27]
Nyquist-Compliant Resampling To match the preprocessing in d’Ascoli et al
C. Nyquist-Compliant Resampling To match the preprocessing in d’Ascoli et al. (2025), we applied a 40Hz low-pass filter before resampling the brain data to 50Hz. This made it possible to do a like-for-like comparison with d’Ascoli et al. (2025). However, this technically violates a tenet of signal processing—specifically, the Nyquist criterion requires th...
work page 2025
-
[28]
2048 Training steps 50 epochs (max
Loss Cross-entropy on masked tokens Fine-Tuning MLP head hidden dim. 2048 Training steps 50 epochs (max. with early stopping) Early stopping patience 10 epochs Early stopping metric Top-10 balanced accuracy on val. Batch size 50 words Learning rate (transformer) 1e-5 Learning rate (MLP head) 1e-3 Weight decay 1e-4 Gradient clipping 1.0 Optimizer AdamW (Lo...
work page 2048
-
[29]
Loss D-SigLIP (Zhai et al., 2023; d’Ascoli et al.,
work page 2023
-
[30]
because of BioCodec’s ability to reconstruct our MEG data, even at 50Hz, with lower reconstruction error. This is likely due to the fact that BioCodec does not compress the channel dimension, entailing a trade off as it results in more tokens than BrainTokenizer. Nevertheless, as the field has not yet advanced to a stage where we know precisely which part...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.