Recognition: 1 theorem link
· Lean TheoremSame Brain, Different Prediction: How Preprocessing Choices Undermine EEG Decoding Reliability
Pith reviewed 2026-05-11 02:33 UTC · model grok-4.3
The pith
EEG predictions flip for up to 42% of trials when only preprocessing changes
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Preprocessing choices constitute a counterfactual intervention space whose variation produces large trial-level instability in EEG decoding. Up to 42% of predictions flip across six datasets when only these choices are altered. A Walsh-Hadamard decomposition of the 2^7 binary pipeline space shows that sensitivity behaves as near-additive under the chosen intervention design. Preprocessing Uncertainty is defined as a per-trial diagnostic that captures a dimension of instability complementary to model-based confidence. Normalized Adaptive PGI is presented as a graph-structured regularizer that exploits the compositional structure of the interventions.
What carries the argument
The counterfactual intervention space over seven binary preprocessing choices, decomposed by Walsh-Hadamard transform to expose near-additive sensitivity to prediction flips.
If this is right
- Prediction reliability assessments in EEG decoding must include variation over preprocessing pipelines rather than conditioning on a single fixed one.
- The near-additive property permits efficient sequential optimization of preprocessing steps without enumerating all 128 combinations.
- Preprocessing Uncertainty can be computed alongside existing model confidence scores to identify trials that are unstable for reasons outside the model itself.
- Normalized Adaptive PGI offers one regularization approach whose effectiveness is bounded by clear scope conditions on the intervention graph.
Where Pith is reading between the lines
- Literature claims of high EEG decoding accuracy may require re-examination when preprocessing pipelines are not ablated.
- Extending the binary intervention design to continuous parameter ranges or additional datasets would test whether near-additivity holds more broadly.
- In applied brain-computer interfaces, unaccounted preprocessing instability could produce inconsistent control signals for the same user across sessions.
Load-bearing premise
The seven binary preprocessing choices and the six chosen datasets adequately represent the variability encountered in typical EEG practice and that the observed prediction flips are not artifacts of the particular model architectures or random seeds used.
What would settle it
An EEG dataset and model pair in which changing the preprocessing pipeline produces prediction flips on fewer than 5% of trials would show that the reported instability is not general.
Figures
read the original abstract
Electroencephalography (EEG) is a cornerstone of brain-computer interfaces and clinical neuroscience, yet deep learning models are typically trained and evaluated under a single, unreported preprocessing pipeline. We formalize preprocessing choices as a counterfactual intervention space and show that EEG predictions are surprisingly unstable under this space: across six datasets spanning four paradigms, up to 42% of trial-level predictions flip when only the preprocessing changes, a variability that standard uncertainty methods do not explicitly quantify because they condition on a fixed preprocessing pipeline. We provide three tools to make this instability measurable, decomposable, and reducible. First, a Walsh-Hadamard decomposition of the 2^7 pipeline space reveals that sensitivity is near-additive in practice under the binary intervention design, enabling efficient step-by-step optimization. Second, we introduce Preprocessing Uncertainty (PU), a per-trial diagnostic that captures a dimension of instability complementary to model-based confidence. Third, we study Normalized Adaptive PGI (NA-PGI), a graph-structured regularizer that exploits the compositional structure of preprocessing interventions as one mitigation strategy with clear scope conditions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that EEG decoding models exhibit substantial instability to preprocessing pipeline choices: across six datasets spanning four paradigms, up to 42% of trial-level predictions flip when only the preprocessing pipeline is altered. It formalizes preprocessing decisions as a 2^7 counterfactual intervention space, demonstrates that sensitivity is near-additive via Walsh-Hadamard decomposition, and introduces Preprocessing Uncertainty (PU) as a per-trial diagnostic complementary to model confidence plus Normalized Adaptive PGI (NA-PGI) as a graph-structured regularizer to reduce the instability.
Significance. If the central empirical result holds under broader controls, the work is significant because it identifies a previously unquantified source of variability that standard uncertainty quantification methods (which fix the pipeline) systematically omit. The near-additive decomposition and the two new metrics provide concrete, decomposable tools that could improve reliability in BCI and clinical EEG applications; the paper also supplies a clear scope for when the mitigation strategy applies.
major comments (2)
- [Results] Results section (the 42% headline figure): the central claim that prediction flips reflect inherent preprocessing instability requires evidence that the rate is not an artifact of the specific deep architectures and single random seeds used. No ablations over alternative models or multiple initializations are reported, leaving open the possibility that more robust architectures would exhibit substantially lower flip rates and thereby weaken the assertion that standard uncertainty methods miss a complementary dimension.
- [Methods] Methods / Experimental Setup: the abstract and main text provide concrete numbers across six datasets yet omit exact model architectures, the precise statistical tests used to establish significance of the flip rates, and the trial exclusion criteria. These omissions make the 42% claim difficult to evaluate or reproduce and constitute a load-bearing gap for the reliability conclusion.
minor comments (1)
- [Methods] The notation for the seven binary preprocessing choices and the precise definition of the intervention space should be stated explicitly in a dedicated table or subsection to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. The comments highlight important aspects of reproducibility and robustness that we address below. We have revised the manuscript to incorporate additional details and experiments where feasible.
read point-by-point responses
-
Referee: [Results] Results section (the 42% headline figure): the central claim that prediction flips reflect inherent preprocessing instability requires evidence that the rate is not an artifact of the specific deep architectures and single random seeds used. No ablations over alternative models or multiple initializations are reported, leaving open the possibility that more robust architectures would exhibit substantially lower flip rates and thereby weaken the assertion that standard uncertainty methods miss a complementary dimension.
Authors: We agree that ablations across architectures and seeds would further strengthen the central claim. Our experiments employed standard EEG decoding models (e.g., variants of EEGNet) across six datasets and four paradigms, with the observed flip rates reaching 42% even under these conditions. This suggests the instability arises primarily from the preprocessing intervention space rather than model idiosyncrasies. To address the concern directly, the revised manuscript will include results from multiple random seeds and one additional architecture, reporting the distribution of flip rates to demonstrate that the phenomenon persists and remains complementary to standard uncertainty quantification. revision: yes
-
Referee: [Methods] Methods / Experimental Setup: the abstract and main text provide concrete numbers across six datasets yet omit exact model architectures, the precise statistical tests used to establish significance of the flip rates, and the trial exclusion criteria. These omissions make the 42% claim difficult to evaluate or reproduce and constitute a load-bearing gap for the reliability conclusion.
Authors: We apologize for these omissions in the main text. The exact model architectures, hyperparameters, statistical tests (binomial tests for significance of flip rates against a null of no change), and trial exclusion criteria (amplitude-based artifact rejection thresholds) were provided in the supplementary materials. In the revised version, we will expand the Methods section with a dedicated experimental setup subsection containing these details, along with pseudocode for the pipeline enumeration and statistical analysis, to ensure full reproducibility. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's core results consist of empirical measurements of trial-level prediction flips (up to 42%) under enumerated preprocessing interventions across six datasets, together with the introduction of a Walsh-Hadamard decomposition of the 2^7 space, the per-trial Preprocessing Uncertainty (PU) diagnostic, and the Normalized Adaptive PGI regularizer. These quantities are defined directly from the intervention design and observed model outputs without any reduction to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations; the decomposition is a standard linear-algebraic tool applied to the binary pipeline space, and the new metrics are constructed to be complementary to existing uncertainty measures rather than tautological with the flip statistics themselves. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Preprocessing choices can be represented as a 2^7 binary intervention space
- domain assumption Trial-level prediction flips are a meaningful measure of model instability
invented entities (2)
-
Preprocessing Uncertainty (PU)
no independent evidence
-
Normalized Adaptive PGI (NA-PGI)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Deep learning-based electroencephalography analysis: a systematic review
Yannick Roy, Hubert Banville, Isabela Albuquerque, Alexandre Gramfort, Tiago H Falk, and Jocelyn Faubert. Deep learning-based electroencephalography analysis: a systematic review. Journal of Neural Engineering, 16(5):051001, 2019
work page 2019
-
[2]
Alexander Craik, Yongtian He, and Jose L Contreras-Vidal. Deep learning for electroencephalo- gram (EEG) classification tasks: a review.Journal of Neural Engineering, 16(3):031001, 2019
work page 2019
-
[3]
Demetres Kostas, St´ephane Aroca-Ouellette, and Frank Rudzicz. BENDR: Using transformers and a contrastive self-supervised learning task to learn from massive amounts of EEG data. Frontiers in Human Neuroscience, 15:653659, 2021
work page 2021
-
[4]
Large brain model for learning generic representations with tremendous EEG data in BCI
Wei-Bang Jiang, Li-Ming Zhao, and Bao-Liang Lu. Large brain model for learning generic representations with tremendous EEG data in BCI. InInternational Conference on Learning Representations, 2024
work page 2024
-
[5]
Chaoqi Yang, M Brandon Westover, and Jimeng Sun. BIOT: Biosignal transformer for cross- data learning in the wild.Advances in Neural Information Processing Systems, 36, 2023
work page 2023
-
[6]
CSBrain: A cross-scale spatiotemporal brain foundation model for EEG decoding
Yuchen Zhou et al. CSBrain: A cross-scale spatiotemporal brain foundation model for EEG decoding. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[7]
NeurIPT: Foundation model for neural interfaces
Zitao Fang et al. NeurIPT: Foundation model for neural interfaces. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[8]
Yassine El Ouahidi et al. REVE: A foundation model for EEG — adapting to any setup with large-scale pretraining on 25,000 subjects. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[9]
Dropout as a Bayesian approximation: Representing model uncertainty in deep learning
Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. InInternational Conference on Machine Learning (ICML), pages 1050–1059, 2016
work page 2016
-
[10]
Simple and scalable predictive uncertainty estimation using deep ensembles
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InNeurIPS, 2017
work page 2017
-
[11]
Nima Bigdely-Shamlo, Tim Mullen, Christian Kothe, Kyung-Min Su, and Kay A Robbins. The PREP pipeline: standardized preprocessing for large-scale EEG analysis.Frontiers in Neuroinformatics, 9:16, 2015
work page 2015
-
[12]
Cristina Gil ´Avila et al. DISCOVER-EEG: an open, fully automated EEG pipeline for biomarker discovery in clinical neuroscience.Scientific Data, 10:613, 2023
work page 2023
-
[13]
Neil W Bailey et al. Introducing RELAX: an automated pre-processing pipeline for cleaning EEG data.Clinical Neurophysiology, 149:178–201, 2023
work page 2023
-
[14]
Adriana B¨ottcher et al. Standardizing EEG preprocessing for cross-site integration—the CLEAN pipeline.NeuroImage, 328:121812, 2026
work page 2026
-
[15]
Federico Del Pup, Andrea Zanola, Louis Fabrice Tshimanga, Alessandra Bertoldo, and Man- fredo Atzori. The more, the better? Evaluating the role of EEG preprocessing for deep learning applications.IEEE Transactions on Neural Systems and Rehabilitation Engineering, 33:1061–1070, 2025
work page 2025
-
[16]
How EEG preprocessing shapes decoding performance.Communications Biology, 8:1039, 2025
Roman Kessler et al. How EEG preprocessing shapes decoding performance.Communications Biology, 8:1039, 2025
work page 2025
-
[17]
Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020
Robert Geirhos, J¨orn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020. 10
work page 2020
-
[18]
Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. Un- derspecification presents challenges for credibility in modern machine learning.Journal of Machine Learning Research, 23(226):1–61, 2022
work page 2022
-
[19]
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3): 107–115, 2021
work page 2021
-
[20]
Sara Steegen, Francis Tuerlinckx, Andrew Gelman, and Wolf Vanpaemel. Increasing trans- parency through a multiverse analysis.Perspectives on Psychological Science, 11(5):702–712, 2016
work page 2016
-
[21]
Variability in the analysis of a single neuroimaging dataset by many teams.Nature, 582:84–88, 2020
Rotem Botvinik-Nezer et al. Variability in the analysis of a single neuroimaging dataset by many teams.Nature, 582:84–88, 2020
work page 2020
-
[22]
Cassie Ann Short et al. Lost in a large EEG multiverse? Comparing sampling approaches for representative pipeline selection.Journal of Neuroscience Methods, 424:110564, 2025
work page 2025
-
[23]
Pipeline-invariant representation learning for neuroimaging.arXiv preprint arXiv:2208.12909, 2022
Xinhui Li et al. Pipeline-invariant representation learning for neuroimaging.arXiv preprint arXiv:2208.12909, 2022
-
[24]
Towards a general-purpose foundation model for functional MRI analysis
Cheng Wang, Yu Jiang, Zhihao Peng, Chenxin Li, Chang-bae Bang, Lin Zhao, Wanyi Fu, Jinglei Lv, Jorge Sepulcre, Carl Yang, Lifang He, Tianming Liu, Xue-Jun Kong, Quanzheng Li, Daniel S Barron, Anqi Qiu, Randy Hirschtick, Byung-Hoon Kim, Hongbin Han, Xiang Li, and Yixuan Yuan. Towards a general-purpose foundation model for functional MRI analysis. Nature Bi...
-
[25]
Martin Arjovsky, L´eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization.arXiv preprint arXiv:1907.02893, 2019
work page internal anchor Pith review arXiv 1907
-
[26]
Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. InICLR, 2020
work page 2020
-
[27]
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franc ¸ois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks.Journal of Machine Learning Research, 17(59):1–35, 2016
work page 2016
-
[28]
Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tianyue Xiang, and Chen Change Loy. Domain general- ization: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4): 4396–4415, 2022
work page 2022
-
[29]
Chun-Yen Chang, Sheng-Hsiou Hsu, Luca Pion-Tonachini, and Tzyy-Ping Jung. Evaluation of artifact subspace reconstruction for automatic artifact components removal in multi-channel EEG recordings.IEEE Transactions on Biomedical Engineering, 67(4):1114–1121, 2020
work page 2020
-
[30]
Autoreject: Automated artifact rejection for MEG and EEG data.NeuroImage, 159:417–429, 2017
Mainak Jas, Denis A Engemann, Yousra Bekhti, Francesca Raimondo, and Alexandre Gramfort. Autoreject: Automated artifact rejection for MEG and EEG data.NeuroImage, 159:417–429, 2017
work page 2017
-
[31]
Vernon J Lawhern, Amelia J Solon, Nicholas R Waytowich, Stephen M Gordon, Chou P Hung, and Brent J Lance. EEGNet: a compact convolutional neural network for EEG-based brain-computer interfaces.Journal of Neural Engineering, 15(5):056013, 2018
work page 2018
-
[32]
Clemens Brunner, Robert Leeb, Gernot M¨uller-Putz, Alois Schl¨ogl, and Gert Pfurtscheller. BCI competition 2008 – Graz data set A.Institute for Knowledge Discovery, Graz University of Technology, 2008
work page 2008
-
[33]
MOABB: trustworthy algorithm benchmarking for BCIs.Journal of Neural Engineering, 15(6):066011, 2018
Vinay Jayaram and Alexandre Barachant. MOABB: trustworthy algorithm benchmarking for BCIs.Journal of Neural Engineering, 15(6):066011, 2018
work page 2018
-
[34]
Gerwin Schalk, Dennis J McFarland, Thilo Hinterberger, Niels Birbaumer, and Jonathan R Wol- paw. BCI2000: a general-purpose brain-computer interface (BCI) system.IEEE Transactions on Biomedical Engineering, 51(6):1034–1043, 2004. 11
work page 2004
-
[35]
Bob Kemp, Aeilko H Zwinderman, Bert Tuk, Hilbert A C Kamphuisen, and Josefien J L Oberye. Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the EEG.IEEE Transactions on Biomedical Engineering, 47(9):1185–1194, 2000
work page 2000
-
[36]
Ary L Goldberger et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals.Circulation, 101(23):e215–e220, 2000
work page 2000
-
[37]
Wei-Long Zheng, Wei Liu, Yifei Lu, Bao-Liang Lu, and Andrzej Cichocki. Emotionmeter: A multimodal framework for recognizing human emotions.IEEE Transactions on Cybernetics, 49(3):1110–1122, 2019
work page 2019
-
[38]
MEG and EEG data analysis with MNE-Python.Frontiers in Neuroscience, 7:267, 2013
Alexandre Gramfort, Martin Luessi, Eric Larson, Denis A Engemann, Daniel Strohmeier, Christian Brodbeck, Lauri Parkkonen, and Matti S H¨am¨al¨ainen. MEG and EEG data analysis with MNE-Python.Frontiers in Neuroscience, 7:267, 2013
work page 2013
-
[39]
K. G. Beauchamp.Walsh Functions and Their Applications. Academic Press, 1975
work page 1975
-
[40]
A re- view of uncertainty quantification in deep learning: Techniques, applications and challenges
Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, et al. A re- view of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, 76:243–297, 2021
work page 2021
-
[41]
In search of lost domain generalization
Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. InICLR, 2021
work page 2021
-
[42]
Deep CORAL: Correlation alignment for deep domain adap- tation
Baochen Sun and Kate Saenko. Deep CORAL: Correlation alignment for deep domain adap- tation. InEuropean Conference on Computer Vision (ECCV) Workshops, pages 443–450, 2016
work page 2016
-
[43]
EEG is better left alone.Scientific Reports, 13:2372, 2023
Arnaud Delorme. EEG is better left alone.Scientific Reports, 13:2372, 2023
work page 2023
-
[44]
Robin Tibor Schirrmeister, Jost Tobias Springenberg, Lukas Dominique Josef Fiederer, Martin Glasstetter, Katharina Eggensperger, Michael Tangermann, Frank Hutter, Wolfram Burgard, and Tonio Ball. Deep learning with convolutional neural networks for EEG decoding and visualization.Human Brain Mapping, 38(11):5391–5420, 2017
work page 2017
-
[45]
Benjamin Blankertz, K-R Muller, Dean J Krusienski, Gerwin Schalk, Jonathan R Wolpaw, et al. The BCI competition III: validating alternative approaches to actual BCI problems.IEEE Transactions on Neural Systems and Rehabilitation Engineering, 14(2):153–159, 2006. 12 A Additional Results A.1 Spearman Rank Correlations of Intervention Importance Table 10: Sp...
work page 2006
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.