Quantifying the Generalization Gap in Seizure Detection: A Large-Scale Empirical Benchmark via the SzCORE Challenge

Amirhossein Shahbazinia; Christodoulos Kechris; David Atienza; Jonathan Dan

arxiv: 2505.18191 · v2 · pith:W5WLHW2Rnew · submitted 2025-05-19 · 📡 eess.SP · cs.AI· cs.LG· cs.PF

Quantifying the Generalization Gap in Seizure Detection: A Large-Scale Empirical Benchmark via the SzCORE Challenge

Jonathan Dan , Amirhossein Shahbazinia , Christodoulos Kechris , David Atienza This is my paper

Pith reviewed 2026-05-22 13:55 UTC · model grok-4.3

classification 📡 eess.SP cs.AIcs.LGcs.PF

keywords seizure detectionEEGgeneralization gapbenchmarkmachine learningepilepsyevent-based evaluation

0 comments

The pith

A benchmark of 28 seizure detection algorithms finds the best F1 score reaches only 32 percent on a large held-out patient dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests a wide range of seizure detection methods, from classical signal processing to modern neural networks, on continuous EEG recordings that were never seen during model development. A private collection of 4360 hours from 65 subjects, annotated by experts, serves as the test bed. The highest F1 score among all entries is 32 percent, with sensitivity at 37 percent and precision at 29 percent. Algorithms that lead in overall scores often fail to maintain steady rankings when performance is broken down by individual subject. The study therefore documents a sizable gap between results reported on internal test sets and results obtained under strict external evaluation.

Core claim

When 28 state-of-the-art seizure detection algorithms are evaluated on a strictly held-out dataset of 4360 hours of continuous EEG from 65 subjects, the highest event-based F1 score is 32 percent. The same algorithms exhibit substantial variation in ranking across individual subjects, so that those with the strongest aggregate scores do not deliver the most stable performance from patient to patient.

What carries the argument

The SzCORE event-based scoring framework applied to a private, expert-annotated EEG dataset that is withheld from all participants until after model submission.

If this is right

Reported performance figures in the literature must be treated as upper bounds until confirmed on independent data.
Peak aggregate scores alone are insufficient; models should also demonstrate stable ranking across subjects.
Future algorithm development can use the same held-out infrastructure to measure genuine progress.
Clinical translation of automatic seizure detection will require evidence of performance on data drawn from varied hospitals and populations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the observed performance ceiling holds on additional external sets, widespread clinical deployment of fully automatic detectors may remain impractical for years.
The benchmark could be extended by adding recordings from pediatric or ICU populations to test whether the current gap is even larger in those groups.
Teams could retrain the top entries on the public training data while explicitly optimizing for cross-subject consistency rather than aggregate F1.

Load-bearing premise

The 4360 hours of recordings from 65 subjects capture the full diversity of patients, clinical environments, and recording conditions that would be met in routine medical use.

What would settle it

Re-running the same 28 algorithms on a new multi-center dataset collected with different EEG hardware and patient demographics would yield a top F1 score above 50 percent.

read the original abstract

Reliable automatic seizure detection from long-term electroencephalography (EEG) remains an unsolved challenge, as current models often fail to generalize across patients or clinical settings. Manual EEG review still is the standard of care, highlighting the need for robust models and standardized evaluation. The current literature often reports high efficacy, yet these models frequently fail when deployed to unseen patient populations. To rigorously assess this generalization gap, we conducted a large-scale empirical study evaluating 28 state-of-the-art algorithmic architectures, ranging from classical feature engineering to modern Deep Learning. These algorithms were collected by organizing a competition. A strictly held-out private dataset of continuous EEG recordings from 65 subjects, totaling 4,360 hours of data, was utilized to evaluate algorithm performance. Expert neurophysiologists annotated these recordings, establishing the ground truth for seizure events. Algorithms were evaluated using event-based metrics from the SzCORE framework, including sensitivity, precision, F1-score, and false positive rate per day. Results revealed significant performance variability among state-of-the-art approaches, with the top F1 score of 32% (sensitivity 37%, precision 29%), highlighting the persistent difficulty of this task. Analysis uncovered a discordance between peak performance and population-level stability. The algorithms achieving the highest aggregate F1-scores did not achieve the most consistent ranking across subjects. This independent evaluation exposed a notable gap between self-reported efficacies and hold-out performance, underscoring the critical need for standardized, rigorous benchmarking. The evaluation infrastructure transitions into a continuously open benchmarking platform, fostering reproducible research and accelerating robust seizure detection algorithm development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This benchmark puts a number on the generalization gap with 32% top F1 on 4360 hours of held-out data, but the value hinges on whether the 65-subject set reflects real clinical variety.

read the letter

The main thing to take from this paper is that even the best of 28 algorithms only hit 32% F1 (37% sensitivity, 29% precision) on a large private held-out EEG set, and the ones with the highest overall scores were not the most stable across individual subjects. That discordance between aggregate and per-patient performance is the clearest new observation here. They ran a competition to gather the algorithms, applied SzCORE event-based metrics, and used expert annotations on 4360 hours from 65 subjects. That scale is bigger than most single studies in the literature, and the shift to a continuously open platform is a practical step toward better reproducibility. The work is straightforward empirical benchmarking with no fitted parameters or circular claims, so the numbers stand on their own as measured performance against external labels. The private data and standardized metrics give it more weight than the usual self-reported results on small public sets. One limitation is the lack of detail on the test cohort. The abstract does not describe multi-center sampling, hardware variation, or how the 65 subjects were chosen, so it is hard to know how far the observed gap travels beyond this particular collection. If the recordings are mostly from one site or similar conditions, the low scores could partly reflect dataset difficulty rather than a universal problem. Algorithm selection criteria also get little space, which leaves room for questions about whether the 28 entries are truly representative of current approaches. This paper is aimed at researchers and engineers working on EEG seizure detection who need realistic numbers for comparison. It is worth a serious referee process because the scale and the per-subject variability findings are concrete enough to discuss and refine. I would send it for review with requests for more on data provenance and selection process.

Referee Report

2 major / 1 minor

Summary. The paper reports results from the SzCORE Challenge, in which 28 state-of-the-art seizure detection algorithms (spanning classical feature engineering to deep learning) were evaluated on a strictly held-out private dataset of 4360 hours of continuous EEG from 65 subjects with expert neurophysiologist annotations. Using event-based SzCORE metrics, the top algorithm achieves an F1-score of 32% (sensitivity 37%, precision 29%), and algorithms with the highest aggregate scores do not achieve the most consistent per-subject rankings. The work positions this as evidence of a generalization gap relative to self-reported results and transitions the evaluation infrastructure into an open, continuously available benchmarking platform.

Significance. If the empirical results hold, the study is significant for providing a large-scale, independent benchmark that quantifies the performance drop on unseen data and introduces per-subject consistency as an additional evaluation axis. The creation of an open platform directly addresses reproducibility challenges in the field and supplies falsifiable performance numbers against which future algorithms can be tested.

major comments (2)

[Abstract] Abstract and dataset description: the private held-out dataset is described only as coming from 65 subjects with expert annotations and totaling 4360 hours, with no details on multi-center sampling, demographic stratification, epilepsy subtype distribution, recording hardware/montage variations, or explicit exclusion rules. This information is load-bearing for the central claim that the observed top F1 of 32% demonstrates a generalizable gap rather than a property of this particular cohort.
[Abstract] Abstract and methods: limited information is supplied on the criteria used to select or solicit the 28 algorithms and on safeguards against competition-induced biases (e.g., self-selection of easier test cases or post-hoc tuning). These details are required to support the interpretation that the ranking discordance and performance gap reflect real-world generalization rather than artifacts of the challenge design.

minor comments (1)

[Results] Results section: the claim that highest-aggregate algorithms lack consistent per-subject ranking would be strengthened by reporting a quantitative measure (e.g., rank variance or Kendall tau across subjects) rather than a qualitative statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for minor revision. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of the dataset and challenge design.

read point-by-point responses

Referee: [Abstract] Abstract and dataset description: the private held-out dataset is described only as coming from 65 subjects with expert annotations and totaling 4360 hours, with no details on multi-center sampling, demographic stratification, epilepsy subtype distribution, recording hardware/montage variations, or explicit exclusion rules. This information is load-bearing for the central claim that the observed top F1 of 32% demonstrates a generalizable gap rather than a property of this particular cohort.

Authors: We agree that expanding the dataset description would better support the interpretation of a generalizable performance gap. While the full manuscript provides additional context in the Methods section on data acquisition, we have revised the abstract and added a concise summary paragraph in the revised manuscript detailing multi-center aspects, available demographic stratification, epilepsy subtype distribution, hardware/montage information, and exclusion rules to the extent permitted by data-sharing agreements. This makes the supporting information more accessible without altering the core results. revision: yes
Referee: [Abstract] Abstract and methods: limited information is supplied on the criteria used to select or solicit the 28 algorithms and on safeguards against competition-induced biases (e.g., self-selection of easier test cases or post-hoc tuning). These details are required to support the interpretation that the ranking discordance and performance gap reflect real-world generalization rather than artifacts of the challenge design.

Authors: The algorithms were solicited via an open call for the SzCORE Challenge, with participation criteria and submission requirements described in the Methods. To prevent post-hoc tuning or selection biases, all entries were evaluated on a strictly private held-out dataset inaccessible to participants, and only fixed code implementations were accepted and executed by the organizers. We have added an explicit subsection in the revised Methods detailing the solicitation process, eligibility rules, and bias-mitigation safeguards to address this point directly. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with direct measurement on held-out data

full rationale

The paper conducts a competition-based evaluation of 28 algorithms on a strictly held-out private EEG dataset of 4360 hours from 65 subjects, with performance (top F1 32%) computed directly from expert annotations using event-based metrics. No mathematical derivations, fitted parameters, predictions, or first-principles claims exist that could reduce to inputs by construction. The central result is an observed performance gap measured against external ground truth, with no self-referential loops or load-bearing self-citations in the evaluation logic. The study is self-contained as an empirical benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the held-out dataset being representative and on expert annotations serving as accurate ground truth; no free parameters or invented entities are introduced in the benchmark itself.

axioms (1)

domain assumption Expert neurophysiologists provide accurate ground truth annotations for seizure events in the EEG recordings.
All performance metrics are computed against these annotations as the reference standard.

pith-pipeline@v0.9.0 · 5844 in / 1369 out tokens · 37222 ms · 2026-05-22T13:55:40.738604+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DANCE: Detect and Classify Events in EEG
cs.LG 2026-05 unverdicted novelty 6.0

DANCE frames EEG event identification as a set-prediction problem to jointly detect and classify events directly from raw, unaligned signals, outperforming existing methods on seizure monitoring and matching onset-inf...