Lightweight Cross-Device Sleep Tracking on the WeBe Wearable Platform
Pith reviewed 2026-05-19 18:15 UTC · model grok-4.3
The pith
A simple pipeline on raw accelerometer signals tracks sleep across wearables with 27 to 42 minute error
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that converting raw accelerometer signals into epoch-level activity features, followed by temporal smoothing and normalized scoring, allows accurate sleep versus wake classification via a single globally calibrated threshold. On the MMASH dataset this yields a mean absolute error of 41.6 minutes in Total Sleep Time along with onset and offset errors of 6.3 and 7.4 minutes. On real-world data collected with the WeBe platform from three participants over five sessions the corresponding errors are 27.4, 13.9 and 8.0 minutes, outperforming a commercial ActiGraph pipeline relative to ground truth.
What carries the argument
Epoch-level activity features derived from raw accelerometer signals, processed by temporal smoothing, normalized scoring, and classification with a globally calibrated threshold.
If this is right
- Open-source sleep tracking becomes feasible without relying on closed commercial algorithms.
- Consistent performance across different wearable hardware reduces the need for per-device model retraining.
- Low computational demands support deployment on battery-constrained devices for continuous monitoring.
- Baseline errors provide a reference point for improving or comparing future sleep analysis methods.
Where Pith is reading between the lines
- Extending the same scoring approach to other physiological signals could broaden its use in health wearables.
- Validating on larger cohorts would strengthen claims of generalizability to diverse populations.
- Real-time implementation on the device itself could enable immediate feedback on sleep quality.
Load-bearing premise
Normalizing the activity scores allows a single threshold to work reliably for sleep and wake detection no matter the specific wearable device or the user's daily routine.
What would settle it
A new experiment using a different wearable sensor type and a larger group of participants where the mean absolute error in total sleep time exceeds 60 minutes would indicate that the global threshold does not generalize as claimed.
Figures
read the original abstract
Wearable devices are widely used for continuous health monitoring, yet reliable sleep tracking on emerging platforms remains underexplored due to reliance on proprietary algorithms and device-specific activity representations. We present a lightweight and reproducible sleep tracking pipeline that operates directly on raw accelerometer signals. The method converts data into epoch-level activity features, applies temporal smoothing and normalized scoring, and performs sleep/wake classification using a globally calibrated threshold. We calibrate the model on the Multilevel Monitoring of Activity and Sleep in Healthy People (MMASH) dataset and evaluate it in a cross-device study using the WeBe wearable platform and a commercial ActiGraph device. On MMASH, the method achieves a mean absolute error of 41.6 minutes in Total Sleep Time (TST), with onset and offset errors of 6.3 and 7.4 minutes. On real-world WeBe data from three participants across five sessions, it achieves a mean TST error of 27.4 minutes and onset and offset errors of 13.9 and 8.0 minutes. In contrast, a commercial ActiGraph pipeline shows larger discrepancies relative to ground truth. These results demonstrate accurate and generalizable sleep tracking using a simple and reproducible pipeline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a lightweight, reproducible pipeline for sleep/wake classification from raw accelerometer signals on wearable devices. Raw data are converted to epoch-level activity features, followed by temporal smoothing, normalized scoring, and binary classification via a single globally calibrated threshold. The threshold is fit on the MMASH dataset (yielding 41.6 min TST MAE, 6.3 min onset error, 7.4 min offset error) and evaluated on WeBe platform recordings from three participants across five sessions (27.4 min TST MAE, 13.9 min onset, 8.0 min offset), with comparisons showing smaller discrepancies than a commercial ActiGraph pipeline.
Significance. If the central results hold under larger-scale validation, the work offers a simple, device-agnostic alternative to proprietary sleep algorithms that could facilitate sleep monitoring on emerging or low-cost wearables. The concrete error metrics, explicit cross-device comparison, and emphasis on reproducibility constitute clear strengths that would support broader adoption if the generalizability concerns are addressed.
major comments (2)
- [Real-world WeBe evaluation] The headline claim of accurate and generalizable cross-device sleep tracking with a single globally calibrated threshold depends on the WeBe results (27.4 min TST MAE). However, these rest on data from only three participants across five sessions. Such limited N cannot establish robust transfer across hardware, populations, or real-world conditions without retraining; participant-specific movement or sleep patterns could inflate apparent performance. MMASH calibration does not compensate for the tiny real-world sample when asserting cross-device robustness.
- [Methods / Abstract] The abstract and methods provide no details on exact feature definitions, the normalization procedure, or validation against potential confounds such as varying sensor placement or participant demographics. Without these, it is difficult to assess whether the normalized scoring step embeds fitted parameters whose independence from the final performance numbers is guaranteed.
minor comments (2)
- [Abstract] The abstract could explicitly state the number of participants and sessions in the WeBe evaluation to allow readers to immediately contextualize the generalizability claims.
- [Discussion] Consider adding an explicit limitations paragraph that directly addresses the small real-world sample size and its implications for the cross-device claims.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which have helped us improve the clarity and balance of the manuscript. We address each major comment below.
read point-by-point responses
-
Referee: [Real-world WeBe evaluation] The headline claim of accurate and generalizable cross-device sleep tracking with a single globally calibrated threshold depends on the WeBe results (27.4 min TST MAE). However, these rest on data from only three participants across five sessions. Such limited N cannot establish robust transfer across hardware, populations, or real-world conditions without retraining; participant-specific movement or sleep patterns could inflate apparent performance. MMASH calibration does not compensate for the tiny real-world sample when asserting cross-device robustness.
Authors: We agree that the small sample size (three participants, five sessions) in the WeBe evaluation is a genuine limitation for strong claims of generalizability. The WeBe recordings represent an initial real-world cross-device test rather than a large-scale validation study. In the revised manuscript we have added an explicit Limitations section that qualifies the cross-device results, notes the preliminary nature of the transfer demonstration, and states that larger cohorts will be needed to confirm robustness across populations and hardware variations. We retain the observation that the globally calibrated threshold (fit only on MMASH) was applied without retraining, but we no longer frame the WeBe numbers as definitive proof of broad generalizability. revision: yes
-
Referee: [Methods / Abstract] The abstract and methods provide no details on exact feature definitions, the normalization procedure, or validation against potential confounds such as varying sensor placement or participant demographics. Without these, it is difficult to assess whether the normalized scoring step embeds fitted parameters whose independence from the final performance numbers is guaranteed.
Authors: We have expanded the Methods section with precise definitions of the epoch-level features (vector magnitude per 30-second epoch, activity counts, and zero-crossing rate), the exact normalization formula (per-session z-score of the activity feature), and a new paragraph discussing potential confounds including wrist placement variability and the demographic characteristics of the MMASH and WeBe cohorts. The revised text explicitly states that no WeBe data were used in threshold calibration or normalization parameter fitting, confirming that the reported performance reflects transfer of a fixed, globally determined threshold. revision: yes
Circularity Check
No circularity: calibration on MMASH and independent evaluation on WeBe data
full rationale
The pipeline converts raw accelerometer signals to epoch-level features, applies temporal smoothing and normalized scoring, then classifies sleep/wake via a single globally calibrated threshold. The threshold is fitted on the MMASH dataset and the resulting model is evaluated on separate real-world WeBe sessions from three participants. No equations or steps reduce the reported TST MAE, onset/offset errors, or cross-device comparison to the calibration inputs by construction. The WeBe results constitute an out-of-sample test rather than a self-referential prediction, and no self-citation chain or ansatz smuggling is present in the derivation.
Axiom & Free-Parameter Ledger
free parameters (1)
- globally calibrated sleep/wake threshold
axioms (1)
- domain assumption Raw accelerometer signals can be reliably converted into epoch-level activity features that distinguish sleep from wake.
Reference graph
Works this paper leans on
-
[1]
Christine Acebo and Monique K LeBourgeois. 2006. Actigraphy.Respiratory care clinics of North America12, 1 (2006), 23–30
work page 2006
-
[2]
Ametris. 2026. ActiGraph LEAP | Ametris Wearable Devices. https://ametris. com/actigraph-leap. [Online; accessed May 2026]
work page 2026
-
[3]
Greg Atkinson and Damien Davenne. 2007. Relationships between sleep, physical activity and human health.Physiology & behavior90, 2-3 (2007), 229–235
work page 2007
-
[4]
Roger J Cole, Daniel F Kripke, William Gruen, Daniel J Mullaney, and J Christian Gillin. 1992. Automatic sleep/wake identification from wrist activity.Sleep15, 5 (1992), 461–469
work page 1992
-
[5]
Massimiliano De Zambotti, Nicola Cellini, Aimee Goldstone, Ian M Colrain, and Fiona C Baker. 2019. Wearable sleep technology in clinical and research settings. Medicine and science in sports and exercise51, 7 (2019), 1538
work page 2019
-
[6]
Ruijie Fang, Sally Hang, Ruoyu Zhang, Chongzhou Fang, Setareh Rafatirad, Camelia Hostinar, and Houman Homayoun. 2024. Validation of webe band during physical activities. In2024 IEEE 20th International Conference on Body Sensor Networks (BSN). IEEE, 1–4
work page 2024
-
[7]
Patty Freedson, David Pober, and Kathleen F Janz. 2005. Calibration of accelerom- eter output for children.Medicine & Science in Sports & Exercise37, 11 (2005), S523–S530
work page 2005
-
[8]
HealtheTile. 2026. We-Be Band – Healthetile. https://healthetile.io/product/we- be-band/. [Online; accessed May 2026]
work page 2026
- [9]
-
[10]
Zequan Liang, Ruoyu Zhang, Wei Shao, Mahdi Pirayesh Shirazi Nejad, Ehsan Kourkchi, Setareh Rafatirad, and Houman Homayoun. 2025. Generalizable Blood Pressure Estimation from Multi-Wavelength PPG Using Curriculum-Adversarial Learning. In2025 IEEE 21st International Conference on Body Sensor Networks (BSN). IEEE, 1–4
work page 2025
-
[11]
Miguel Marino, Yi Li, Michael N Rueschman, John W Winkelman, Jeffrey M Ellenbogen, Jo M Solet, Hilary Dulin, Lisa F Berkman, and Orfeu M Buxton
-
[12]
Measuring sleep: accuracy, sensitivity, and specificity of wrist actigraphy compared to polysomnography.Sleep36, 11 (2013), 1747–1755
work page 2013
-
[13]
Charles E Matthew. 2005. Calibration of accelerometer output for adults.Medicine & Science in Sports & Exercise37, 11 (2005), S512–S522
work page 2005
-
[14]
Alessio Rossi, Eleonora Da Pozzo, Dario Menicagli, Chiara Tremolanti, Corrado Priami, Alina Sirbu, David Clifton, Claudia Martini, and David Morelli. 2020. Multilevel Monitoring of Activity and Sleep in Healthy People.PhysioNet(June 2020). doi:10.13026/cerq-fc86 Version 1.0.0
-
[15]
Avi Sadeh, M Sharkey, and Mary A Carskadon. 1994. Activity-based sleep-wake identification: an empirical test of methodological issues.Sleep17, 3 (1994), 201–207
work page 1994
-
[16]
Wei Shao, Zequan Liang, Ruoyu Zhang, Ruijie Fang, Ning Miao, Ehsan Kourkchi, Setareh Rafatirad, Houman Homayoun, and Chongzhou Fang. 2025. Know me by my pulse: Toward practical continuous authentication on wearable devices via wrist-worn ppg.arXiv preprint arXiv:2508.13690(2025)
-
[17]
Wei Shao, Ruoyu Zhang, Zequan Liang, Ehsan Kourkchi, Setareh Rafatirad, and Houman Homayoun. 2025. Self-Supervised and Topological Signal-Quality As- sessment for Any PPG Device. In2025 IEEE 21st International Conference on Body Sensor Networks (BSN). IEEE, 1–4
work page 2025
-
[18]
Catrine Tudor-Locke, Tiago V Barreira, John M Schuna Jr, Emily F Mire, and Peter T Katzmarzyk. 2014. Fully automated waist-worn accelerometer algorithm for detecting children’s sleep-period time separate from 24-h physical activity or sedentary behaviors.Applied physiology, nutrition, and metabolism39, 1 (2014), 53–57
work page 2014
-
[19]
Ruoyu Zhang, Ruijie Fang, Mahdi Orooji, and Houman Homayoun. 2024. Intro- ducing we-be band: an end-to-end platform for continuous health monitoring. In2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 1–5
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.