Recognition: unknown
Towards Multi-Source Domain Generalization for Sleep Staging with Noisy Labels
Pith reviewed 2026-05-10 16:45 UTC · model grok-4.3
The pith
FF-TRUST achieves robust multi-source sleep staging by combining time-frequency early learning regularization with confidence-diversity terms to handle noisy labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FF-TRUST is a domain-invariant multimodal framework that applies Joint Time-Frequency Early Learning Regularization (JTF-ELR) together with confidence-diversity regularization. By exploiting temporal and spectral consistency, the method improves robustness to noisy supervision while preserving generalization across multiple data sources. Experiments on five public datasets confirm consistent state-of-the-art results under both symmetric and asymmetric label noise.
What carries the argument
Joint Time-Frequency Early Learning Regularization (JTF-ELR) inside the FF-TRUST framework, which enforces consistency across temporal and spectral views of the signal to separate reliable patterns from label noise.
If this is right
- Existing noisy-label learning methods degrade when domain shifts and label noise coexist.
- The NL-DGSS benchmark exposes these limitations across multiple public sleep datasets.
- FF-TRUST delivers consistent state-of-the-art performance under diverse symmetric and asymmetric noise settings.
- Making the benchmark and code public enables direct comparison of future methods on the same noisy multi-source sleep staging task.
Where Pith is reading between the lines
- The same regularization approach could be tested on other noisy multimodal physiological signals such as ECG or EMG to check transferability.
- Performance on subgroups defined by age, pathology, or recording hardware not represented in the five datasets would reveal hidden biases.
- Combining JTF-ELR with explicit domain-adversarial losses might further strengthen invariance when distribution shifts are extreme.
Load-bearing premise
The joint time-frequency regularization and confidence-diversity terms will continue to separate signal from noise without creating new biases when domain shifts and label noise interact in combinations beyond the tested symmetric and asymmetric cases.
What would settle it
A controlled test introducing an unseen combination of domain shift and noise type where FF-TRUST accuracy falls below a simpler baseline or where the model begins to fit the injected noise patterns.
Figures
read the original abstract
Automatic sleep staging is a multimodal learning problem involving heterogeneous physiological signals such as EEG and EOG, which often suffer from domain shifts across institutions, devices, and populations. In practice, these data are also affected by noisy annotations, yet label-noise-robust multi-source domain generalization remains underexplored. We present the first benchmark for Noisy Labels in Multi-Source Domain-Generalized Sleep Staging (NL-DGSS) and show that existing noisy-label learning methods degrade substantially when domain shifts and label noise coexist. To address this challenge, we propose FF-TRUST, a domain-invariant multimodal sleep staging framework with Joint Time-Frequency Early Learning Regularization (JTF-ELR). By jointly exploiting temporal and spectral consistency together with confidence-diversity regularization, FF-TRUST improves robustness under noisy supervision. Experiments on five public datasets demonstrate consistent state-of-the-art performance under diverse symmetric and asymmetric noise settings. The benchmark and code will be made publicly available at https://github.com/KNWang970918/FF-TRUST.git.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the first benchmark for Noisy Labels in Multi-Source Domain-Generalized Sleep Staging (NL-DGSS) and proposes FF-TRUST, a multimodal framework using Joint Time-Frequency Early Learning Regularization (JTF-ELR) together with confidence-diversity regularization to achieve domain-invariant sleep staging under label noise. Experiments on five public datasets are reported to yield consistent state-of-the-art performance under symmetric and asymmetric noise.
Significance. If the central claims hold after addressing validation gaps, the work would be significant for an important practical domain: it provides the first dedicated benchmark and method at the intersection of multi-source domain generalization and label-noise robustness for sleep staging, where both domain shifts and annotation noise are prevalent in EEG/EOG data across institutions.
major comments (2)
- Abstract: the claim of 'consistent state-of-the-art performance' and 'improves robustness' is presented without any quantitative tables, ablation results, or statistical tests, preventing verification of effect sizes or sensitivity to hyper-parameters and splits.
- Experiments section: evaluation is restricted to synthetic symmetric and asymmetric random-flip noise models. This does not test whether JTF-ELR plus confidence-diversity regularization separates signal from structured, non-random annotation biases (e.g., scorer-specific over-scoring or device artifacts) that commonly co-vary with the domain shifts the method targets.
minor comments (1)
- The abstract states that benchmark and code will be released at the given GitHub link; the manuscript should explicitly confirm that the released materials include the exact noise-generation scripts and cross-domain splits used in the reported experiments.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have made revisions to strengthen the presentation and discussion of our results.
read point-by-point responses
-
Referee: Abstract: the claim of 'consistent state-of-the-art performance' and 'improves robustness' is presented without any quantitative tables, ablation results, or statistical tests, preventing verification of effect sizes or sensitivity to hyper-parameters and splits.
Authors: The abstract provides a high-level summary of the contributions and findings. All quantitative support—including performance tables under multiple noise ratios, ablation studies isolating JTF-ELR and confidence-diversity regularization, and statistical significance tests—is contained in the Experiments section. To improve verifiability from the abstract alone, we have added a sentence directing readers to the specific tables and figures that substantiate the claims. revision: yes
-
Referee: Experiments section: evaluation is restricted to synthetic symmetric and asymmetric random-flip noise models. This does not test whether JTF-ELR plus confidence-diversity regularization separates signal from structured, non-random annotation biases (e.g., scorer-specific over-scoring or device artifacts) that commonly co-vary with the domain shifts the method targets.
Authors: We agree that structured, non-random biases are common in sleep staging annotations. Our benchmark deliberately employs controlled synthetic noise to isolate the combined effects of domain shift and label noise, which is the novel aspect of NL-DGSS. In the revised manuscript we have expanded the Discussion and Limitations sections to explicitly acknowledge this scope, provide qualitative analysis of how the joint time-frequency consistency and confidence-diversity terms may help against correlated biases, and identify real-world structured noise as an important direction for future work. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper introduces FF-TRUST, a framework using Joint Time-Frequency Early Learning Regularization (JTF-ELR) and confidence-diversity regularization for noisy-label multi-source domain generalization in sleep staging. It establishes a new benchmark (NL-DGSS) and reports empirical results on five public datasets under synthetic symmetric/asymmetric noise. No derivation chain, equations, or first-principles results are present that reduce any claimed performance gain to a quantity defined by the method's own inputs or fitted parameters. The approach extends prior early-learning ideas without self-definitional loops, fitted-input predictions, or load-bearing self-citations that collapse the central empirical claim. The result is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Aboalayon, K. A. I., Faezipour, M., Almuhammadi, W. S., and Moslehpour, S.Sleep stage classification using EEG signal analysis: A comprehensive survey and new investigation.Entropy(2016)
2016
-
[2]
M.Inter-database validation of a deep learning approach for automatic sleep scoring.PloS one(2021)
Alvarez-Estevez, D., and Rijsman, R. M.Inter-database validation of a deep learning approach for automatic sleep scoring.PloS one(2021)
2021
-
[3]
Baglioni, C., Spiegelhalder, K., Nissen, C., and Riemann, D.Clinical im- plications of the causal relationship between insomnia and depression: how individually tailored treatment of sleeping difficulties could prevent the onset of depression.Epma Journal(2011)
2011
-
[4]
B., Brooks, R., Gamaldo, C
Berry, R. B., Brooks, R., Gamaldo, C. E., Harding, S. M., Marcus, C., and V aughn, B. V.The AASM manual for the scoring of sleep and associated events. Rules, Terminology and Technical Specifications, Darien, Illinois, American Academy of Sleep Medicine(2012)
2012
-
[5]
A.Duration, timing and quality of sleep are each vital for health, performance and safety.Sleep Health: Journal of the National Sleep Foundation (2015)
Czeisler, C. A.Duration, timing and quality of sleep are each vital for health, performance and safety.Sleep Health: Journal of the National Sleep Foundation (2015)
2015
-
[6]
M., Moody, B
Ghassemi, M. M., Moody, B. E., Lehman, L.-W. H., Song, C., Li, Q., Sun, H., Mark, R. G., Westover, M. B., and Clifford, G. D.You snooze, you win: the physionet/computing in cardiology challenge 2018. InCinC(2018)
2018
-
[7]
L., Amaral, L
Goldberger, A. L., Amaral, L. A. N., Glass, L., Hausdorff, J. M., Ivanov, P. C., Mark, R. G., Mietus, J. E., Moody, G. B., Peng, C.-K., and Stanley, H. E. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals.Circulation(2000)
2000
-
[8]
In ICLR(2021)
Gulrajani, I., and Lopez-Paz, D.In search of lost domain generalization. In ICLR(2021)
2021
-
[9]
Co-teaching: Robust training of deep neural networks with extremely noisy labels.NeurIPS(2018)
Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels.NeurIPS(2018)
2018
-
[10]
Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y.Learning deep representations by mutual information estimation and maximization.arXiv preprint arXiv:1808.06670(2018)
work page Pith review arXiv 2018
-
[11]
Hu, S., Liao, Z., Zhang, J., and Xia, Y.Domain and content adaptive convolution based multi-source domain generalization for medical image segmentation.TMI (2022)
2022
-
[12]
L.Multi-view spatial-temporal graph convolutional networks with domain generalization for sleep stage classification.TNSRE(2021)
Jia, Z., Lin, Y., Wang, J., Ning, X., He, Y., Zhou, R., Zhou, Y., and Li-wei, H. L.Multi-view spatial-temporal graph convolutional networks with domain generalization for sleep stage classification.TNSRE(2021)
2021
- [13]
-
[14]
H., Tuk, B., Kamphuisen, H
Kemp, B., Zwinderman, A. H., Tuk, B., Kamphuisen, H. A. C., and Oberye, J. J. L.Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the EEG.TBME(2000)
2000
-
[15]
M., and Nunes, U.ISRUC-Sleep: A compre- hensive public dataset for sleep researchers.Computer Methods and Programs in Biomedicine(2016)
Khalighi, S., Sousa, T., Santos, J. M., and Nunes, U.ISRUC-Sleep: A compre- hensive public dataset for sleep researchers.Computer Methods and Programs in Biomedicine(2016)
2016
-
[16]
J., Lee, J
Lee, Y. J., Lee, J. Y., Cho, J. H., and Choi, J. H.Interrater reliability of sleep stage scoring: a meta-analysis.Journal of Clinical Sleep Medicine(2022)
2022
-
[17]
Li, J., Socher, R., and Hoi, S. C. H.DivideMix: Learning with noisy labels as semi-supervised learning. InICLR(2020)
2020
-
[18]
InCVPR(2023)
Li, Y., Han, H., Shan, S., and Chen, X.DISC: Learning from noisy labels via dynamic instance-specific selection and correction. InCVPR(2023)
2023
-
[19]
Future Generation Computer Systems(2025)
Lin, Z., Jiang, X., Zhang, K., Fan, C., and Liu, Y.FedDSHAR: A dual-strategy federated learning approach for human activity recognition amid noise label user. Future Generation Computer Systems(2025)
2025
-
[20]
InNeurIPS(2020)
Liu, S., Niles-Weed, J., Razavian, N., and Fernandez-Granda, C.Early- learning regularization prevents memorization of noisy labels. InNeurIPS(2020)
2020
-
[21]
H., Wang, Z., and Liu, D.Causality inspired representation learning for domain generalization
Lv, F., Liang, J., Li, S., Zang, B., Liu, C. H., Wang, Z., and Liu, D.Causality inspired representation learning for domain generalization. InCVPR(2022)
2022
-
[22]
TIST(2024)
Ma, S., Zhang, Y., Chen, Y., Xie, T., Song, S., and Jia, Z.Exploring structure incentive domain adversarial learning for generalizable sleep stage classification. TIST(2024). [23]Ma, S., Zhang, Y., Zhang, Q., Chen, Y., W ang, H., and Jia, Z.SleepMG: Multi- modal generalizable sleep staging with inter-modal balance of classification and domain discriminati...
2024
-
[23]
InICML(2020)
Ma, X., Huang, H., W ang, Y., Romano, S., Erfani, S., and Bailey, J.Normalized loss functions for deep learning with noisy labels. InICML(2020)
2020
-
[24]
InICLR(2025)
Nagaraj, S., Gerych, W., Tonekaboni, S., Goldenberg, A., Ustun, B., and Hartvigsen, T.Learning under temporal label noise. InICLR(2025)
2025
-
[25]
J., and Westover, M
Nasiri, S., Ganglberger, W., Sun, H., Thomas, R. J., and Westover, M. B. Exploiting labels from multiple experts in automated sleep scoring, 2023
2023
-
[26]
A.Chronic insomnia as a risk factor for developing anxiety and depression.Sleep(2007)
Neckelmann, D., Mykletun, A., and Dahl, A. A.Chronic insomnia as a risk factor for developing anxiety and depression.Sleep(2007)
2007
-
[27]
InCVPR(2017)
Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., and Qu, L.Making deep neural networks robust to label noise: A loss correction approach. InCVPR(2017)
2017
- [28]
-
[29]
Peng, K., Wen, D., Y ang, K., Fu, J., Chen, Y., Liu, R., Wu, J., Zheng, J., Sarfraz, M. S., Gool, L. V., Paudel, D. P., and Stiefelhagen, R.EReLiFM: Evidential reliability-aware residual flow meta-learning for open-set domain generalization under noisy labels.arXiv preprint arXiv:2510.12687(2025)
-
[30]
S., Roitberg, A., and Stiefelhagen, R.Advancing open-set domain generalization using evidential bi-level hardest domain scheduler
Peng, K., Wen, D., Y ang, K., Luo, A., Chen, Y., Fu, J., Sarfraz, M. S., Roitberg, A., and Stiefelhagen, R.Advancing open-set domain generalization using evidential bi-level hardest domain scheduler. InNeurIPS(2024)
2024
-
[31]
J., and Igel, C
Perslev, M., Darkner, S., Kempfner, L., Nikolic, M., Jennum, P. J., and Igel, C. U-sleep: resilient high-frequency sleep staging.NPJ digital medicine(2021)
2021
-
[32]
H., Darkner, S., Jennum, P
Perslev, M., Jensen, M. H., Darkner, S., Jennum, P. J., and Igel, C.U-Time: A fully convolutional network for time series segmentation applied to sleep staging. InNeurIPS(2019)
2019
-
[33]
F., Howard, B
Quan, S. F., Howard, B. V., Iber, C., Kiley, J. P., Nieto, F. J., O’Connor, G. T., Rapoport, D. M., Redline, S., Robbins, J., Samet, J. M., Samet, J. M., and W ahl, P. W.The sleep heart health study: Design, rationale, and methods.Sleep(1997)
1997
-
[34]
Training deep neural networks on noisy labels with bootstrapping
Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., and Rabinovich, A. Training deep neural networks on noisy labels with bootstrapping.arXiv preprint arXiv:1412.6596(2015)
-
[35]
D., Metaldi, M., Bechny, M., Filchenko, I., Meer, J
Rossi, A. D., Metaldi, M., Bechny, M., Filchenko, I., Meer, J. v. d., Schmidt, M. H., Bassetti, C. L., Tzovara, A., Faraci, F. D., and Fiorillo, L.SLEEPYLAND: trust begins with fair evaluation of automatic sleep staging models.npj Digital Medicine(2025)
2025
-
[36]
Sharma, S., and Kavuru, M.Sleep and metabolism: an overview.International Journal of Endocrinology(2010)
2010
-
[37]
InAAAI(2024)
Sheng, M., Sun, Z., Cai, Z., Chen, T., Zhou, Y., and Y ao, Y.Adaptive integration of partial label learning and negative learning for enhanced noisy label learning. InAAAI(2024)
2024
-
[38]
Supratak, A., Dong, H., Wu, C., and Guo, Y.DeepSleepNet: A model for automatic sleep stage scoring based on raw single-channel EEG.TNSRE(2017)
2017
-
[39]
InEMBC(2020)
Supratak, A., and Guo, Y.TinySleepNet: An efficient deep learning model for sleep stage scoring based on raw single-channel EEG. InEMBC(2020)
2020
-
[40]
M., Fonseca, P., Overeem, S., and van Sloun, R
V an Gorp, H., van Gilst, M. M., Fonseca, P., Overeem, S., and van Sloun, R. J. Modeling the impact of inter-rater disagreement on sleep statistics using deep generative learning.IEEE Journal of Biomedical and Health Informatics(2023). [42]W ang, J., Lan, C., Liu, C., Ouyang, Y., Qin, T., Lu, W., Chen, Y., Zeng, W., and Yu, P.Generalizing to unseen domain...
2023
-
[41]
InICCV(2025)
W ang, J., Liu, X., Zhou, X., Hu, G., Zhai, D., Jiang, J., and Ji, X.Joint asymmetric loss for learning with noisy labels. InICCV(2025)
2025
-
[42]
InAAAI(2024)
Wang, J., Zhao, S., Jiang, H., Li, S., Li, T., and Pan, G.Generalizable sleep staging via multi-level domain alignment. InAAAI(2024)
2024
-
[43]
InCVPR(2020)
Wei, H., Feng, L., Chen, X., and An, B.Combating noisy labels by agreement: A joint training method with co-regularization. InCVPR(2020)
2020
-
[44]
InAAAI(2024)
Wei, Y., and Han, Y.Multi-source collaborative gradient discrepancy minimiza- tion for federated domain generalization. InAAAI(2024). [47]Wolpert, E. A.A manual of standardized terminology, techniques and scoring system for sleep stages of human subjects.Archives of General Psychiatry(1969)
2024
-
[45]
Zhang, G.-Q., Cui, L., Mueller, R., Tao, S., Kim, M., Rueschman, M., Mariani, S., Mobley, D., and Redline, S.The national sleep research resource: towards a sleep data commons.Journal of the American Medical Informatics Association (2018)
2018
-
[46]
InNeurIPS(2018)
Zhang, Z., and Sabuncu, M.Generalized cross entropy loss for training deep neural networks with noisy labels. InNeurIPS(2018)
2018
-
[47]
C.Domain generalization: A survey.TPAMI(2023)
Zhou, K., Liu, Z., Qiao, Y., Xiang, T., and Loy, C. C.Domain generalization: A survey.TPAMI(2023). 0 1 2 3 4 5 6 7 Time (h) Wake N1 N2 N3 REM Sleep stage True labels Predicted labels (a) DISC 0 1 2 3 4 5 6 7 Time (h) Wake N1 N2 N3 REM Sleep stage True labels Predicted labels (b) FF-TRUST Figure 6: Comparison of hypnograms generated by the base- line model...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.