arxiv: 2604.10009 · v1 · submitted 2026-04-11 · 💻 cs.LG · cs.CV· cs.RO

Recognition: unknown

Towards Multi-Source Domain Generalization for Sleep Staging with Noisy Labels

Kening Wang , Di Wen , Yufan Chen , Ruiping Liu , Junwei Zheng , Jiale Wei , Kailun Yang , Rainer Stiefelhagen

show 1 more author

Kunyu Peng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:45 UTC · model grok-4.3

classification 💻 cs.LG cs.CVcs.RO

keywords sleep stagingdomain generalizationlabel noisemulti-sourceEEGmultimodalearly learning regularizationrobust learning

0 comments

The pith

FF-TRUST achieves robust multi-source sleep staging by combining time-frequency early learning regularization with confidence-diversity terms to handle noisy labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the joint problem of domain shifts across institutions and devices together with noisy annotations in multimodal sleep staging from signals such as EEG and EOG. Existing noisy-label methods lose performance when these two issues appear at the same time, prompting the creation of a dedicated benchmark called NL-DGSS. FF-TRUST counters this by enforcing consistency in both the time and frequency domains early in training while adding regularization that balances model and prediction diversity. A reader would care because accurate automated sleep staging could support clinical use even when training data come from varied sources and carry imperfect labels.

Core claim

FF-TRUST is a domain-invariant multimodal framework that applies Joint Time-Frequency Early Learning Regularization (JTF-ELR) together with confidence-diversity regularization. By exploiting temporal and spectral consistency, the method improves robustness to noisy supervision while preserving generalization across multiple data sources. Experiments on five public datasets confirm consistent state-of-the-art results under both symmetric and asymmetric label noise.

What carries the argument

Joint Time-Frequency Early Learning Regularization (JTF-ELR) inside the FF-TRUST framework, which enforces consistency across temporal and spectral views of the signal to separate reliable patterns from label noise.

If this is right

Existing noisy-label learning methods degrade when domain shifts and label noise coexist.
The NL-DGSS benchmark exposes these limitations across multiple public sleep datasets.
FF-TRUST delivers consistent state-of-the-art performance under diverse symmetric and asymmetric noise settings.
Making the benchmark and code public enables direct comparison of future methods on the same noisy multi-source sleep staging task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regularization approach could be tested on other noisy multimodal physiological signals such as ECG or EMG to check transferability.
Performance on subgroups defined by age, pathology, or recording hardware not represented in the five datasets would reveal hidden biases.
Combining JTF-ELR with explicit domain-adversarial losses might further strengthen invariance when distribution shifts are extreme.

Load-bearing premise

The joint time-frequency regularization and confidence-diversity terms will continue to separate signal from noise without creating new biases when domain shifts and label noise interact in combinations beyond the tested symmetric and asymmetric cases.

What would settle it

A controlled test introducing an unseen combination of domain shift and noise type where FF-TRUST accuracy falls below a simpler baseline or where the model begins to fit the injected noise patterns.

Figures

Figures reproduced from arXiv: 2604.10009 by Di Wen, Jiale Wei, Junwei Zheng, Kailun Yang, Kening Wang, Kunyu Peng, Rainer Stiefelhagen, Ruiping Liu, Yufan Chen.

**Figure 2.** Figure 2: An overview of FF-TRUST architecture, comprising domain-invariant sleep staging feature learning and the proposed [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 5.** Figure 5: Comparison of training accuracy (a) and MF1 score [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 3.** Figure 3: Comparison of hypnograms generated by the base [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: T-SNE visualizations of feature distributions for [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Comparison of hypnograms generated by the base [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 8.** Figure 8: Comparison of hypnograms generated by the base [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of hypnograms generated by the base [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

read the original abstract

Automatic sleep staging is a multimodal learning problem involving heterogeneous physiological signals such as EEG and EOG, which often suffer from domain shifts across institutions, devices, and populations. In practice, these data are also affected by noisy annotations, yet label-noise-robust multi-source domain generalization remains underexplored. We present the first benchmark for Noisy Labels in Multi-Source Domain-Generalized Sleep Staging (NL-DGSS) and show that existing noisy-label learning methods degrade substantially when domain shifts and label noise coexist. To address this challenge, we propose FF-TRUST, a domain-invariant multimodal sleep staging framework with Joint Time-Frequency Early Learning Regularization (JTF-ELR). By jointly exploiting temporal and spectral consistency together with confidence-diversity regularization, FF-TRUST improves robustness under noisy supervision. Experiments on five public datasets demonstrate consistent state-of-the-art performance under diverse symmetric and asymmetric noise settings. The benchmark and code will be made publicly available at https://github.com/KNWang970918/FF-TRUST.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper introduces a practical benchmark for noisy multi-source domain generalization in sleep staging but relies on synthetic noise for its robustness claims.

read the letter

The paper's key point is that it creates the first benchmark for the joint problem of noisy labels and multi-source domain generalization in sleep staging, and it introduces FF-TRUST with joint time-frequency early learning regularization to improve robustness. Existing noisy label techniques lose ground when domain shifts are added, which the authors demonstrate on five public datasets. Their method combines temporal and spectral consistency with confidence-diversity regularization, and they report state-of-the-art results under symmetric and asymmetric noise. What the work does well is to make this combined setting a concrete benchmark and to release the code and data splits. That gives the community a shared starting point for testing ideas on real medical signal data where both domain variation and annotation errors are common. The soft spot is the choice of noise model. The experiments rely on synthetic symmetric and asymmetric label flips. In actual sleep staging, noise often comes from systematic differences between scorers or from device artifacts that tend to align with the domain shifts the method is meant to handle. If the early learning assumption or the diversity term does not hold for those structured errors, the gains may not transfer. The paper does not appear to test against real multi-annotator noise or noise correlated with domains. This paper is for people who build and evaluate models for physiological signals like EEG. A reader looking for practical robustness techniques in healthcare AI will get value from the benchmark and the baseline comparisons. The thinking is clear and the problem is well-motivated. It deserves a serious referee to examine the full experimental setup and to push on whether the regularization generalizes beyond random noise. I recommend sending it for peer review.

Referee Report

2 major / 1 minor

Summary. The paper introduces the first benchmark for Noisy Labels in Multi-Source Domain-Generalized Sleep Staging (NL-DGSS) and proposes FF-TRUST, a multimodal framework using Joint Time-Frequency Early Learning Regularization (JTF-ELR) together with confidence-diversity regularization to achieve domain-invariant sleep staging under label noise. Experiments on five public datasets are reported to yield consistent state-of-the-art performance under symmetric and asymmetric noise.

Significance. If the central claims hold after addressing validation gaps, the work would be significant for an important practical domain: it provides the first dedicated benchmark and method at the intersection of multi-source domain generalization and label-noise robustness for sleep staging, where both domain shifts and annotation noise are prevalent in EEG/EOG data across institutions.

major comments (2)

Abstract: the claim of 'consistent state-of-the-art performance' and 'improves robustness' is presented without any quantitative tables, ablation results, or statistical tests, preventing verification of effect sizes or sensitivity to hyper-parameters and splits.
Experiments section: evaluation is restricted to synthetic symmetric and asymmetric random-flip noise models. This does not test whether JTF-ELR plus confidence-diversity regularization separates signal from structured, non-random annotation biases (e.g., scorer-specific over-scoring or device artifacts) that commonly co-vary with the domain shifts the method targets.

minor comments (1)

The abstract states that benchmark and code will be released at the given GitHub link; the manuscript should explicitly confirm that the released materials include the exact noise-generation scripts and cross-domain splits used in the reported experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have made revisions to strengthen the presentation and discussion of our results.

read point-by-point responses

Referee: Abstract: the claim of 'consistent state-of-the-art performance' and 'improves robustness' is presented without any quantitative tables, ablation results, or statistical tests, preventing verification of effect sizes or sensitivity to hyper-parameters and splits.

Authors: The abstract provides a high-level summary of the contributions and findings. All quantitative support—including performance tables under multiple noise ratios, ablation studies isolating JTF-ELR and confidence-diversity regularization, and statistical significance tests—is contained in the Experiments section. To improve verifiability from the abstract alone, we have added a sentence directing readers to the specific tables and figures that substantiate the claims. revision: yes
Referee: Experiments section: evaluation is restricted to synthetic symmetric and asymmetric random-flip noise models. This does not test whether JTF-ELR plus confidence-diversity regularization separates signal from structured, non-random annotation biases (e.g., scorer-specific over-scoring or device artifacts) that commonly co-vary with the domain shifts the method targets.

Authors: We agree that structured, non-random biases are common in sleep staging annotations. Our benchmark deliberately employs controlled synthetic noise to isolate the combined effects of domain shift and label noise, which is the novel aspect of NL-DGSS. In the revised manuscript we have expanded the Discussion and Limitations sections to explicitly acknowledge this scope, provide qualitative analysis of how the joint time-frequency consistency and confidence-diversity terms may help against correlated biases, and identify real-world structured noise as an important direction for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces FF-TRUST, a framework using Joint Time-Frequency Early Learning Regularization (JTF-ELR) and confidence-diversity regularization for noisy-label multi-source domain generalization in sleep staging. It establishes a new benchmark (NL-DGSS) and reports empirical results on five public datasets under synthetic symmetric/asymmetric noise. No derivation chain, equations, or first-principles results are present that reduce any claimed performance gain to a quantity defined by the method's own inputs or fitted parameters. The approach extends prior early-learning ideas without self-definitional loops, fitted-input predictions, or load-bearing self-citations that collapse the central empirical claim. The result is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.0 · 5507 in / 1082 out tokens · 32033 ms · 2026-05-10T16:45:46.874200+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 5 canonical work pages

[1]

Aboalayon, K. A. I., Faezipour, M., Almuhammadi, W. S., and Moslehpour, S.Sleep stage classification using EEG signal analysis: A comprehensive survey and new investigation.Entropy(2016)

2016
[2]

M.Inter-database validation of a deep learning approach for automatic sleep scoring.PloS one(2021)

Alvarez-Estevez, D., and Rijsman, R. M.Inter-database validation of a deep learning approach for automatic sleep scoring.PloS one(2021)

2021
[3]

Baglioni, C., Spiegelhalder, K., Nissen, C., and Riemann, D.Clinical im- plications of the causal relationship between insomnia and depression: how individually tailored treatment of sleeping difficulties could prevent the onset of depression.Epma Journal(2011)

2011
[4]

B., Brooks, R., Gamaldo, C

Berry, R. B., Brooks, R., Gamaldo, C. E., Harding, S. M., Marcus, C., and V aughn, B. V.The AASM manual for the scoring of sleep and associated events. Rules, Terminology and Technical Specifications, Darien, Illinois, American Academy of Sleep Medicine(2012)

2012
[5]

A.Duration, timing and quality of sleep are each vital for health, performance and safety.Sleep Health: Journal of the National Sleep Foundation (2015)

Czeisler, C. A.Duration, timing and quality of sleep are each vital for health, performance and safety.Sleep Health: Journal of the National Sleep Foundation (2015)

2015
[6]

M., Moody, B

Ghassemi, M. M., Moody, B. E., Lehman, L.-W. H., Song, C., Li, Q., Sun, H., Mark, R. G., Westover, M. B., and Clifford, G. D.You snooze, you win: the physionet/computing in cardiology challenge 2018. InCinC(2018)

2018
[7]

L., Amaral, L

Goldberger, A. L., Amaral, L. A. N., Glass, L., Hausdorff, J. M., Ivanov, P. C., Mark, R. G., Mietus, J. E., Moody, G. B., Peng, C.-K., and Stanley, H. E. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals.Circulation(2000)

2000
[8]

In ICLR(2021)

Gulrajani, I., and Lopez-Paz, D.In search of lost domain generalization. In ICLR(2021)

2021
[9]

Co-teaching: Robust training of deep neural networks with extremely noisy labels.NeurIPS(2018)

Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels.NeurIPS(2018)

2018
[10]

Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y.Learning deep representations by mutual information estimation and maximization.arXiv preprint arXiv:1808.06670(2018)

work page Pith review arXiv 2018
[11]

Hu, S., Liao, Z., Zhang, J., and Xia, Y.Domain and content adaptive convolution based multi-source domain generalization for medical image segmentation.TMI (2022)

2022
[12]

L.Multi-view spatial-temporal graph convolutional networks with domain generalization for sleep stage classification.TNSRE(2021)

Jia, Z., Lin, Y., Wang, J., Ning, X., He, Y., Zhou, R., Zhou, Y., and Li-wei, H. L.Multi-view spatial-temporal graph convolutional networks with domain generalization for sleep stage classification.TNSRE(2021)

2021
[13]

Jo, S., Yoon, J. S., Jeong, W., Oh, K., and Suk, H.-I.MEASURE: Multi-scale minimal sufficient representation learning for domain generalization in sleep staging.arXiv preprint arXiv:2510.12070(2025)

work page arXiv 2025
[14]

H., Tuk, B., Kamphuisen, H

Kemp, B., Zwinderman, A. H., Tuk, B., Kamphuisen, H. A. C., and Oberye, J. J. L.Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the EEG.TBME(2000)

2000
[15]

M., and Nunes, U.ISRUC-Sleep: A compre- hensive public dataset for sleep researchers.Computer Methods and Programs in Biomedicine(2016)

Khalighi, S., Sousa, T., Santos, J. M., and Nunes, U.ISRUC-Sleep: A compre- hensive public dataset for sleep researchers.Computer Methods and Programs in Biomedicine(2016)

2016
[16]

J., Lee, J

Lee, Y. J., Lee, J. Y., Cho, J. H., and Choi, J. H.Interrater reliability of sleep stage scoring: a meta-analysis.Journal of Clinical Sleep Medicine(2022)

2022
[17]

Li, J., Socher, R., and Hoi, S. C. H.DivideMix: Learning with noisy labels as semi-supervised learning. InICLR(2020)

2020
[18]

InCVPR(2023)

Li, Y., Han, H., Shan, S., and Chen, X.DISC: Learning from noisy labels via dynamic instance-specific selection and correction. InCVPR(2023)

2023
[19]

Future Generation Computer Systems(2025)

Lin, Z., Jiang, X., Zhang, K., Fan, C., and Liu, Y.FedDSHAR: A dual-strategy federated learning approach for human activity recognition amid noise label user. Future Generation Computer Systems(2025)

2025
[20]

InNeurIPS(2020)

Liu, S., Niles-Weed, J., Razavian, N., and Fernandez-Granda, C.Early- learning regularization prevents memorization of noisy labels. InNeurIPS(2020)

2020
[21]

H., Wang, Z., and Liu, D.Causality inspired representation learning for domain generalization

Lv, F., Liang, J., Li, S., Zang, B., Liu, C. H., Wang, Z., and Liu, D.Causality inspired representation learning for domain generalization. InCVPR(2022)

2022
[22]

TIST(2024)

Ma, S., Zhang, Y., Chen, Y., Xie, T., Song, S., and Jia, Z.Exploring structure incentive domain adversarial learning for generalizable sleep stage classification. TIST(2024). [23]Ma, S., Zhang, Y., Zhang, Q., Chen, Y., W ang, H., and Jia, Z.SleepMG: Multi- modal generalizable sleep staging with inter-modal balance of classification and domain discriminati...

2024
[23]

InICML(2020)

Ma, X., Huang, H., W ang, Y., Romano, S., Erfani, S., and Bailey, J.Normalized loss functions for deep learning with noisy labels. InICML(2020)

2020
[24]

InICLR(2025)

Nagaraj, S., Gerych, W., Tonekaboni, S., Goldenberg, A., Ustun, B., and Hartvigsen, T.Learning under temporal label noise. InICLR(2025)

2025
[25]

J., and Westover, M

Nasiri, S., Ganglberger, W., Sun, H., Thomas, R. J., and Westover, M. B. Exploiting labels from multiple experts in automated sleep scoring, 2023

2023
[26]

A.Chronic insomnia as a risk factor for developing anxiety and depression.Sleep(2007)

Neckelmann, D., Mykletun, A., and Dahl, A. A.Chronic insomnia as a risk factor for developing anxiety and depression.Sleep(2007)

2007
[27]

InCVPR(2017)

Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., and Qu, L.Making deep neural networks robust to label noise: A loss correction approach. InCVPR(2017)

2017
[28]

Peng, K., Wen, D., Saqib, S. M., Chen, Y., Zheng, J., Schneider, D., Yang, K., Wu, J., Roitberg, A., and Stiefelhagen, R.Mitigating label noise using prompt-based hyperbolic meta-learning in open-set domain generalization.arXiv preprint arXiv:2412.18342(2024)

work page arXiv 2024
[29]

S., Gool, L

Peng, K., Wen, D., Y ang, K., Fu, J., Chen, Y., Liu, R., Wu, J., Zheng, J., Sarfraz, M. S., Gool, L. V., Paudel, D. P., and Stiefelhagen, R.EReLiFM: Evidential reliability-aware residual flow meta-learning for open-set domain generalization under noisy labels.arXiv preprint arXiv:2510.12687(2025)

work page arXiv 2025
[30]

S., Roitberg, A., and Stiefelhagen, R.Advancing open-set domain generalization using evidential bi-level hardest domain scheduler

Peng, K., Wen, D., Y ang, K., Luo, A., Chen, Y., Fu, J., Sarfraz, M. S., Roitberg, A., and Stiefelhagen, R.Advancing open-set domain generalization using evidential bi-level hardest domain scheduler. InNeurIPS(2024)

2024
[31]

J., and Igel, C

Perslev, M., Darkner, S., Kempfner, L., Nikolic, M., Jennum, P. J., and Igel, C. U-sleep: resilient high-frequency sleep staging.NPJ digital medicine(2021)

2021
[32]

H., Darkner, S., Jennum, P

Perslev, M., Jensen, M. H., Darkner, S., Jennum, P. J., and Igel, C.U-Time: A fully convolutional network for time series segmentation applied to sleep staging. InNeurIPS(2019)

2019
[33]

F., Howard, B

Quan, S. F., Howard, B. V., Iber, C., Kiley, J. P., Nieto, F. J., O’Connor, G. T., Rapoport, D. M., Redline, S., Robbins, J., Samet, J. M., Samet, J. M., and W ahl, P. W.The sleep heart health study: Design, rationale, and methods.Sleep(1997)

1997
[34]

Training deep neural networks on noisy labels with bootstrapping

Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., and Rabinovich, A. Training deep neural networks on noisy labels with bootstrapping.arXiv preprint arXiv:1412.6596(2015)

work page arXiv 2015
[35]

D., Metaldi, M., Bechny, M., Filchenko, I., Meer, J

Rossi, A. D., Metaldi, M., Bechny, M., Filchenko, I., Meer, J. v. d., Schmidt, M. H., Bassetti, C. L., Tzovara, A., Faraci, F. D., and Fiorillo, L.SLEEPYLAND: trust begins with fair evaluation of automatic sleep staging models.npj Digital Medicine(2025)

2025
[36]

Sharma, S., and Kavuru, M.Sleep and metabolism: an overview.International Journal of Endocrinology(2010)

2010
[37]

InAAAI(2024)

Sheng, M., Sun, Z., Cai, Z., Chen, T., Zhou, Y., and Y ao, Y.Adaptive integration of partial label learning and negative learning for enhanced noisy label learning. InAAAI(2024)

2024
[38]

Supratak, A., Dong, H., Wu, C., and Guo, Y.DeepSleepNet: A model for automatic sleep stage scoring based on raw single-channel EEG.TNSRE(2017)

2017
[39]

InEMBC(2020)

Supratak, A., and Guo, Y.TinySleepNet: An efficient deep learning model for sleep stage scoring based on raw single-channel EEG. InEMBC(2020)

2020
[40]

M., Fonseca, P., Overeem, S., and van Sloun, R

V an Gorp, H., van Gilst, M. M., Fonseca, P., Overeem, S., and van Sloun, R. J. Modeling the impact of inter-rater disagreement on sleep statistics using deep generative learning.IEEE Journal of Biomedical and Health Informatics(2023). [42]W ang, J., Lan, C., Liu, C., Ouyang, Y., Qin, T., Lu, W., Chen, Y., Zeng, W., and Yu, P.Generalizing to unseen domain...

2023
[41]

InICCV(2025)

W ang, J., Liu, X., Zhou, X., Hu, G., Zhai, D., Jiang, J., and Ji, X.Joint asymmetric loss for learning with noisy labels. InICCV(2025)

2025
[42]

InAAAI(2024)

Wang, J., Zhao, S., Jiang, H., Li, S., Li, T., and Pan, G.Generalizable sleep staging via multi-level domain alignment. InAAAI(2024)

2024
[43]

InCVPR(2020)

Wei, H., Feng, L., Chen, X., and An, B.Combating noisy labels by agreement: A joint training method with co-regularization. InCVPR(2020)

2020
[44]

InAAAI(2024)

Wei, Y., and Han, Y.Multi-source collaborative gradient discrepancy minimiza- tion for federated domain generalization. InAAAI(2024). [47]Wolpert, E. A.A manual of standardized terminology, techniques and scoring system for sleep stages of human subjects.Archives of General Psychiatry(1969)

2024
[45]

Zhang, G.-Q., Cui, L., Mueller, R., Tao, S., Kim, M., Rueschman, M., Mariani, S., Mobley, D., and Redline, S.The national sleep research resource: towards a sleep data commons.Journal of the American Medical Informatics Association (2018)

2018
[46]

InNeurIPS(2018)

Zhang, Z., and Sabuncu, M.Generalized cross entropy loss for training deep neural networks with noisy labels. InNeurIPS(2018)

2018
[47]

C.Domain generalization: A survey.TPAMI(2023)

Zhou, K., Liu, Z., Qiao, Y., Xiang, T., and Loy, C. C.Domain generalization: A survey.TPAMI(2023). 0 1 2 3 4 5 6 7 Time (h) Wake N1 N2 N3 REM Sleep stage True labels Predicted labels (a) DISC 0 1 2 3 4 5 6 7 Time (h) Wake N1 N2 N3 REM Sleep stage True labels Predicted labels (b) FF-TRUST Figure 6: Comparison of hypnograms generated by the base- line model...

2023