pith. sign in

arxiv: 2605.18837 · v1 · pith:AAWXV46Onew · submitted 2026-05-13 · 💻 cs.LG · cs.AI· eess.SP

VCR: Learning Valid Contextual Representation for Incomplete Wearable Signals

Pith reviewed 2026-05-20 22:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AIeess.SP
keywords wearable signalsmodality missingnessself-supervised learningorthogonal disentanglementhealth monitoringmixture of expertssensor incompleteness
0
0 comments X

The pith

VCR learns valid contextual representations for incomplete wearable signals by disentangling shared and modality-specific features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VCR, a self-supervised framework designed to extract useful representations from multimodal wearable signals even when sensors are missing or incomplete. Existing approaches often try to reconstruct full missing signals, which risks creating details that cannot be inferred from the available data and harms overall robustness. VCR instead relies on an orthogonal tokenizer that rectifies latent manifolds and applies geometric projection to split each modality into shared semantic information and modality-specific residuals while keeping all original information intact. These tokens feed into a missing-aware mixture-of-experts backbone that adjusts to the current set of available sensors, and the training goal is restricted to reconstructing only the shared components for any missing modalities. This setup produces consistent gains in performance and robustness across health monitoring tasks under full, single-missing, and multiple-missing conditions compared with prior supervised and self-supervised methods.

Core claim

VCR employs an orthogonal tokenizer to enforce strict orthogonal disentanglement by rectifying latent manifolds and applying a geometric projection, separating each modality into shared semantics and modality-specific residuals. This design preserves complete information integrity while serving as a structural foundation for robust learning under modality missingness. The resulting tokens are processed by a missing-aware mixture-of-experts backbone that adapts to varying patterns of modality availability. By constraining the objective to reconstruct only the shared components of missing modalities, VCR effectively mitigates hallucinations of non-inferable modality-specific details and yields

What carries the argument

The orthogonal tokenizer, which rectifies latent manifolds and applies a geometric projection to separate each modality into shared semantics and modality-specific residuals.

If this is right

  • VCR maintains strong performance when all sensor modalities are available.
  • The method increases robustness when one or more modalities are missing.
  • By reconstructing only shared components, the approach avoids generating unsupported details in absent signals.
  • The missing-aware mixture-of-experts adapts training and inference to any observed pattern of sensor availability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of inferable shared information from non-inferable specifics could apply to other multimodal sensor problems where inputs intermittently drop out.
  • If the disentanglement holds, the shared representations might transfer more reliably across different wearable devices or health monitoring tasks.
  • Testing the geometric projection step on datasets with varying missing rates would clarify how much the orthogonality contributes to the observed robustness gains.

Load-bearing premise

An orthogonal tokenizer can enforce strict separation of shared semantics from modality-specific residuals through manifold rectification and geometric projection without any loss of essential information.

What would settle it

Measure whether accuracy on health outcome prediction drops sharply when multiple modalities are removed if the reconstruction objective is changed to include full modality-specific details instead of only shared components.

Figures

Figures reproduced from arXiv: 2605.18837 by Qijia Shao, Wenhan Luo, Yuxuan Weng.

Figure 1
Figure 1. Figure 1: (a) Performance under full and single-modal missingness (see Experiments for multiple missingness). We report normalized Macro-F1 (value/max) and MAE (min/value) within each dataset to compare methods across tasks and datasets. Left: comparison to supervised baselines trained on labeled data only. Right: comparison to pretrained/self-supervised baselines. Higher is better. (b) Modality-specific information… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of VCR. (a) Orthogonal Tokenizer: During training, encoders map raw signals to embeddings h˜, which are rectified and projected by P into orthogonal shared s and specific p subspaces. Optimization combines cross-modal contrastive learning and intra-modal reconstruction in a full modality setting. During inference, our tokenizer uses a zero-padded placeholder with a “missingness flag" for missing m… view at source ↗
Figure 3
Figure 3. Figure 3: (a) t-SNE visualization of the embeddings of the original samples and those with recon￾structed ACC signals. (b) Performance Drop vs. Representation Shift in LSM-2. 3.3 Analysis Quantitative Assessment of Disentanglement. To quantify the independence between the modal￾shared component s and the modal-specific component p, we compute the Hilbert-Schmidt In￾dependence Criterion (HSIC) using a Gaussian kernel… view at source ↗
Figure 4
Figure 4. Figure 4: Analysis of the MoE Mechanism. Experts 1–3 capture different dependencies be￾tween EDA and the other three modalities, while Expert 7 primarily handles cases where other modalities are missing [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Macro-F1 drop of different ablations compared to VCR. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distributions of Three Downstream Tasks (WESAD, AAUWSS, and DaLiA). [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Rank evolution during training. H Hyperparameter Settings For the loss weights tuning, we fix λalign = 1 and λrand = 1, so the main coefficients are: λrecon, λwhite, and λstruct. For Tokenizer, λrecon / λalign controls the trade-off between reconstructability and cross-modal shared semantics. If λrecon is too large, the model can offload part of the shared information into the specific branch p, because re… view at source ↗
read the original abstract

Wearable devices enable continuous health monitoring from multimodal signals, but real-world deployment is hindered by limited labeled data and pervasive sensor incompleteness. While large-scale self-supervised pretraining reduces label dependence, most existing methods assume full modality availability. Current approaches for handling modality missingness often reconstruct entire absent signals, which can encourage hallucinating modality-specific details that are not inferable from the observed sensor signals and degrade robustness. We propose VCR, a self-supervised framework that learns to extract valid representations robust to modality missingness. VCR employs an orthogonal tokenizer to enforce strict orthogonal disentanglement by rectifying latent manifolds and applying a geometric projection, separating each modality into shared semantics and modality-specific residuals. This design preserves complete information integrity while serving as a structural foundation for robust learning under modality missingness. The resulting tokens are processed by a missing-aware mixture-of-experts backbone that adapts to varying patterns of modality availability. By constraining the objective to reconstruct only the shared components of missing modalities, VCR effectively mitigates hallucinations of non-inferable modality-specific details. Across multiple health monitoring tasks, VCR consistently improves performance and robustness under full, single-missing, and multiple-missing modality settings compared with strong supervised and self-supervised baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes VCR, a self-supervised framework for learning valid contextual representations from incomplete multimodal wearable signals in health monitoring. It introduces an orthogonal tokenizer that rectifies latent manifolds and applies a geometric projection to disentangle each modality into shared semantics and modality-specific residuals while preserving information integrity. These tokens feed a missing-aware mixture-of-experts backbone that adapts to availability patterns and reconstructs only shared components of missing modalities to avoid hallucinating non-inferable details. The central claim is that VCR yields consistent gains in performance and robustness versus strong supervised and self-supervised baselines under full, single-missing, and multiple-missing modality settings.

Significance. If the empirical claims hold, the work addresses a practical deployment barrier for continuous wearable health monitoring by providing a structural solution to modality missingness that avoids reconstruction hallucinations. The orthogonal disentanglement via geometric projection offers a principled alternative to standard imputation or masking strategies in multimodal self-supervised learning.

major comments (2)
  1. [Method section, orthogonal tokenizer] Method section, orthogonal tokenizer: the claim that latent manifold rectification plus geometric projection produces strictly orthogonal shared semantics while preserving all inferable shared information is load-bearing for the missing-aware reconstruction objective and the reported robustness gains. No information-theoretic bound or ablation isolating information loss under nonlinear, time-varying correlations (typical of wearable signals) is provided, leaving open the possibility that recoverable shared content is discarded or non-inferable specifics leak into the shared tokens.
  2. [Results section] Results section: the abstract asserts consistent improvements across missingness settings, yet the provided text supplies no quantitative metrics, error bars, statistical tests, or ablations that isolate the orthogonal tokenizer's contribution from the mixture-of-experts backbone. This absence prevents verification that the claimed robustness stems from the proposed disentanglement rather than other factors.
minor comments (1)
  1. [Introduction] Clarify the precise definition of 'valid contextual representation' and how the geometric projection differs from standard orthogonalization techniques (e.g., Gram-Schmidt) already used in multimodal learning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the orthogonal tokenizer and empirical validation. We address each major comment point by point below.

read point-by-point responses
  1. Referee: Method section, orthogonal tokenizer: the claim that latent manifold rectification plus geometric projection produces strictly orthogonal shared semantics while preserving all inferable shared information is load-bearing for the missing-aware reconstruction objective and the reported robustness gains. No information-theoretic bound or ablation isolating information loss under nonlinear, time-varying correlations (typical of wearable signals) is provided, leaving open the possibility that recoverable shared content is discarded or non-inferable specifics leak into the shared tokens.

    Authors: We agree that the orthogonality claim is central. The geometric projection is designed to enforce orthogonality by construction after manifold rectification, mapping components to orthogonal subspaces. Preservation of inferable shared information is encouraged by the reconstruction objective targeting only shared components. We acknowledge the absence of a formal information-theoretic bound for nonlinear cases. In revision, we will add an ablation using synthetic signals with controlled nonlinear correlations to quantify retention and leakage in shared tokens. revision: yes

  2. Referee: Results section: the abstract asserts consistent improvements across missingness settings, yet the provided text supplies no quantitative metrics, error bars, statistical tests, or ablations that isolate the orthogonal tokenizer's contribution from the mixture-of-experts backbone. This absence prevents verification that the claimed robustness stems from the proposed disentanglement rather than other factors.

    Authors: The full manuscript's Results section (Section 4) presents quantitative metrics with means and standard deviations as error bars, along with statistical tests and ablations comparing models with and without the orthogonal tokenizer. To better isolate its contribution, we will expand the revision with a dedicated table and analysis breaking down the tokenizer's impact across full and missing-modality settings. revision: yes

Circularity Check

0 steps flagged

No significant circularity: VCR framework introduces independent structural constraints

full rationale

The paper presents VCR as a self-supervised framework relying on an orthogonal tokenizer (via latent manifold rectification and geometric projection) and a missing-aware mixture-of-experts backbone. These are introduced as novel design choices that enforce disentanglement and constrain reconstruction to shared components only. The performance claims rest on empirical comparisons under full, single-missing, and multi-missing settings rather than any derivation that reduces by construction to fitted parameters, self-citations, or renamed inputs. No load-bearing step equates a claimed prediction or uniqueness result to its own inputs; the derivation chain remains self-contained with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that wearable signals admit clean separation into shared semantics and modality-specific residuals, plus the modeling choice that constraining reconstruction to shared components avoids hallucination; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Wearable multimodal signals contain separable shared semantics and modality-specific residuals that can be isolated via orthogonal projection without loss of integrity.
    Invoked as the structural foundation for the orthogonal tokenizer and robust learning under missingness.

pith-pipeline@v0.9.0 · 5747 in / 1154 out tokens · 42271 ms · 2026-05-20T22:08:48.296584+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 1 internal anchor

  1. [1]

    Introducing wesad, a multimodal dataset for wearable stress and affect detection

    Philip Schmidt, Attila Reiss, Robert Duerichen, Claus Marberger, and Kristof Van Laerhoven. Introducing wesad, a multimodal dataset for wearable stress and affect detection. InProceedings of the 20th ACM international conference on multimodal interaction, pages 400–408, 2018

  2. [2]

    Rummana Bari, Md Mahbubur Rahman, Nazir Saleheen, Megan Battles Parsons, Eugene H Buder, and Santosh Kumar. Automated detection of stressful conversations using wearable physiological and inertial sensors.Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, 4(4):1–23, 2020

  3. [3]

    Bing Zhai, Ignacio Perez-Pozuelo, Emma AD Clifton, Joao Palotti, and Yu Guan. Making sense of sleep: Multimodal sleep stage classification in a large, diverse population using movement and cardiac sensing.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(2):1–33, 2020

  4. [4]

    Real-Time Sleep Staging using Deep Learning on a Smartphone for a Wearable EEG

    Abhay Koushik, Judith Amores, and Pattie Maes. Real-time sleep staging using deep learning on a smartphone for a wearable eeg.arXiv preprint arXiv:1811.10111, 2018

  5. [5]

    Taoran Sheng and Manfred Huber. Weakly supervised multi-task representation learning for human activity analysis using wearables.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(2):1–18, 2020

  6. [6]

    Shengzhong Liu, Shuochao Yao, Jinyang Li, Dongxin Liu, Tianshi Wang, Huajie Shao, and Tarek Abdelzaher. Giobalfusion: A global attentional deep learning framework for multisensor information fusion.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–27, 2020

  7. [7]

    Beyond sensor data: Foundation models of behavioral data from wearables improve health predictions

    Eray Erturk, Fahad Kamran, Salar Abbaspourazad, Sean Jewell, Harsh Sharma, Yujie Li, Sinead Williamson, Nicholas J Foti, and Joseph Futoma. Beyond sensor data: Foundation models of behavioral data from wearables improve health predictions. InProceedings of the 42nd International Conference on Machine Learning (ICML), pages 15516–15541. PMLR, 2025

  8. [8]

    Tailor, Jacob Sunshine, Yun Liu, Tim Althoff, Shrikanth Narayanan, Pushmeet Kohli, Jiening Zhan, Mark Malhotra, Shwetak Patel, Samy Abdel-Ghaffar, and Daniel McDuff

    Girish Narayanswamy, Xin Liu, Kumar Ayush, Yuzhe Yang, Xuhai Xu, Shun Liao, Jake Garrison, Shyam A. Tailor, Jacob Sunshine, Yun Liu, Tim Althoff, Shrikanth Narayanan, Pushmeet Kohli, Jiening Zhan, Mark Malhotra, Shwetak Patel, Samy Abdel-Ghaffar, and Daniel McDuff. Scaling wearable foundation models. InThe Thirteenth International Conference on Learning R...

  9. [9]

    Relcon: Relative contrastive learning for a motion foundation model for wearable data

    Maxwell A Xu, Jaya Narain, Gregory Darnell, Haraldur T Hallgrimsson, Hyewon Jeong, Darren Forde, Richard Andres Fineman, Karthik Jayaraman Raghuram, James Matthew Rehg, and Shirley You Ren. Relcon: Relative contrastive learning for a motion foundation model for wearable data. InThe Thirteenth International Conference on Learning Representations (ICLR), 2024

  10. [10]

    Papagei: Open foundation models for optical physiological signals

    Arvind Pillai, Dimitris Spathis, Fahim Kawsar, and Mohammad Malekzadeh. Papagei: Open foundation models for optical physiological signals. InThe Thirteenth International Conference on Learning Representations (ICLR), 2024

  11. [11]

    Sensorlm: Learning the language of wearable sensors.Advances in neural information processing systems (NeurIPS), 2025

    Yuwei Zhang, Kumar Ayush, Siyuan Qiao, A Ali Heydari, Girish Narayanswamy, Maxwell A Xu, Ahmed A Metwally, Shawn Xu, Jake Garrison, Xuhai Xu, et al. Sensorlm: Learning the language of wearable sensors.Advances in neural information processing systems (NeurIPS), 2025

  12. [12]

    Contrastive multiview coding

    Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. InEuropean conference on computer vision (ECCV), pages 776–794. Springer, 2020

  13. [13]

    Masked siamese networks for label-efficient learning

    Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Mike Rabbat, and Nicolas Ballas. Masked siamese networks for label-efficient learning. InComputer Vision – ECCV 2022, pages 456–473. Springer, 2022. doi: 10.1007/978-3-031-19821-2_26. 10

  14. [14]

    Self-supervised contrastive pre-training for time series via time-frequency consistency.Advances in neural information processing systems (NeurIPS), 35:3988–4003, 2022

    Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. Self-supervised contrastive pre-training for time series via time-frequency consistency.Advances in neural information processing systems (NeurIPS), 35:3988–4003, 2022

  15. [15]

    Xu, Girish Narayanswamy, Kumar Ayush, Dimitris Spathis, Shun Liao, Shyam A

    Maxwell A. Xu, Girish Narayanswamy, Kumar Ayush, Dimitris Spathis, Shun Liao, Shyam A. Tailor, Ahmed Metwally, A. Ali Heydari, Yuwei Zhang, Jake Garrison, Samy Abdel-Ghaffar, Xuhai Xu, Ken Gu, Jacob Sunshine, Ming-Zher Poh, Yun Liu, Tim Althoff, Shrikanth Narayanan, Pushmeet Kohli, Mark Malhotra, Shwetak Patel, Yuzhe Yang, James M. Rehg, Xin Liu, and Dani...

  16. [16]

    Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts

    Sukwon Yun, Inyoung Choi, Jie Peng, Yangfan Wu, Jingxuan Bao, Qiyiwen Zhang, Jiayi Xin, Qi Long, and Tianlong Chen. Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, 2024. doi: 10.52202/079017-3135

  17. [17]

    Fusemoe: mixture-of-experts transformers for fleximodal fusion

    Xing Han, Huy Nguyen, Carl Harris, Nhat Ho, and Suchi Saria. Fusemoe: mixture-of-experts transformers for fleximodal fusion. InProceedings of the 38th International Conference on Neural Information Processing Systems (NeurIPS), pages 67850–67900, 2024

  18. [18]

    Domain separation networks

    Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 29, pages 343–351, 2016

  19. [19]

    Ssole: Rethinking orthogonal low-rank em- bedding for self-supervised learning

    Lun Huang, Qiang Qiu, and Guillermo Sapiro. Ssole: Rethinking orthogonal low-rank em- bedding for self-supervised learning. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

  20. [20]

    Focal: Contrastive learning for multimodal time-series sensing signals in factorized orthogonal latent space

    Shengzhong Liu, Tomoyoshi Kimura, Dongxin Liu, Ruijie Wang, Jinyang Li, Suhas Diggavi, Mani Srivastava, and Tarek Abdelzaher. Focal: Contrastive learning for multimodal time-series sensing signals in factorized orthogonal latent space. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

  21. [21]

    Barlow twins: Self- supervised learning via redundancy reduction

    Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self- supervised learning via redundancy reduction. InInternational conference on machine learning (ICML), pages 12310–12320. PMLR, 2021

  22. [22]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119 ofProceedings of Machine Learning Research, pages 1597–1607, 2020

  23. [23]

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations (ICLR), 2017

  24. [24]

    Data2vec: A general framework for self-supervised learning in speech, vision and language

    Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International conference on machine learning (ICML), pages 1298–1312. PMLR, 2022

  25. [25]

    Masked feature prediction for self-supervised visual pre-training

    Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichten- hofer. Masked feature prediction for self-supervised visual pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 14668–14678, 2022

  26. [26]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 16000–16009, 2022

  27. [27]

    Engineering digital biomarkers of interstitial glucose from noninvasive smartwatches.NPJ Digital Medicine, 4(1):89, 2021

    Brinnae Bent, Peter J Cho, Maria Henriquez, April Wittmann, Connie Thacker, Mark Feinglos, Matthew J Crowley, and Jessilyn P Dunn. Engineering digital biomarkers of interstitial glucose from noninvasive smartwatches.NPJ Digital Medicine, 4(1):89, 2021. 11

  28. [28]

    Fusion of learned representations for multimodal sensor data classification

    Lee B Hinkle, Gentry Atkinson, and Vangelis Metsis. Fusion of learned representations for multimodal sensor data classification. InIFIP International Conference on Artificial Intelligence Applications and Innovations, pages 404–415. Springer, 2023

  29. [29]

    Can-stress: A real-world multimodal dataset for understanding cannabis use, stress, and physiological responses.arXiv preprint arXiv:2503.19935, 2025

    Reza Rahimi Azghan, Nicholas C Glodosky, Ramesh Kumar Sah, Carrie Cuttler, Ryan McLaugh- lin, Michael J Cleveland, and Hassan Ghasemzadeh. Can-stress: A real-world multimodal dataset for understanding cannabis use, stress, and physiological responses.arXiv preprint arXiv:2503.19935, 2025

  30. [30]

    Deep ppg: Large-scale heart rate estimation with convolutional neural networks.Sensors, 19(14):3079, 2019

    Attila Reiss, Ina Indlekofer, Philip Schmidt, and Kristof Van Laerhoven. Deep ppg: Large-scale heart rate estimation with convolutional neural networks.Sensors, 19(14):3079, 2019

  31. [31]

    Nielsen, and Anders Bruun

    Shagen Djanian, Thomas Dyhre Nielsen, Søren H. Nielsen, and Anders Bruun. Aalborg University Wearable Sleep Study (AAUWSS), August 2025. URL https://doi.org/10. 5281/zenodo.16919071. Version 1.0

  32. [32]

    A multidevice and multimodal dataset for human energy expenditure estimation using wearable devices.Scientific Data, 9(1):537, 2022

    Shkurta Gashi, Chulhong Min, Alessandro Montanari, Silvia Santini, and Fahim Kawsar. A multidevice and multimodal dataset for human energy expenditure estimation using wearable devices.Scientific Data, 9(1):537, 2022

  33. [33]

    URLhttps://ieeexplore.ieee.org/document/7780459

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi: 10.1109/CVPR.2016.90

  34. [34]

    An image is worth 16×16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

  35. [35]

    Mf-clr: Multi-frequency contrastive learning representation for time series

    Jufang Duan, Wei Zheng, Yangzhou Du, Wenfa Wu, Haipeng Jiang, and Hongsheng Qi. Mf-clr: Multi-frequency contrastive learning representation for time series. InForty-first International Conference on Machine Learning (ICML), 2024

  36. [36]

    Large-scale training of foundation models for wearable biosignals

    Salar Abbaspourazad, Oussama Elachqar, Andrew Miller, Saba Emrani, Udhyakumar Nallasamy, and Ian Shapiro. Large-scale training of foundation models for wearable biosignals. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

  37. [37]

    Clocs: Contrastive learning of cardiac signals across space, time, and patients

    Dani Kiyasseh, Tingting Zhu, and David A Clifton. Clocs: Contrastive learning of cardiac signals across space, time, and patients. InInternational Conference on Machine Learning (ICML), pages 5606–5615. PMLR, 2021

  38. [38]

    Unsupervised representation learning for time series with temporal neighborhood coding

    Sana Tonekaboni, Danny Eytan, and Anna Goldenberg. Unsupervised representation learning for time series with temporal neighborhood coding. InInternational Conference on Learning Representations (ICLR), 2021

  39. [39]

    Ts2vec: Towards universal representation of time series

    Zhihan Yue, Yujing Wang, Juanyong Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong, and Bixiong Xu. Ts2vec: Towards universal representation of time series. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 8980–8987, 2022

  40. [40]

    A time series is worth 64 words: Long-term forecasting with transformers

    Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

  41. [41]

    Moment: A family of open time-series foundation models

    Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: A family of open time-series foundation models. InInternational Conference on Machine Learning, pages 16115–16152. PMLR, 2024

  42. [42]

    Simmtm: A simple pre-training framework for masked time-series modeling.Advances in Neural Information Processing Systems (NeurIPS), 36:29996–30025, 2023

    Jiaxiang Dong, Haixu Wu, Haoran Zhang, Li Zhang, Jianmin Wang, and Mingsheng Long. Simmtm: A simple pre-training framework for masked time-series modeling.Advances in Neural Information Processing Systems (NeurIPS), 36:29996–30025, 2023

  43. [43]

    Time-series representation learning via temporal and contextual contrasting

    Emadeldeen Eldele, Mohamed Ragab, Zhenghua Chen, Min Wu, Chee Keong Kwoh, Xiaoli Li, and Cuntai Guan. Time-series representation learning via temporal and contextual contrasting. InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI), pages 2352–2359. International Joint Conferences on Artificial Intelligence Orga...

  44. [44]

    Guiding masked representa- tion learning to capture spatio-temporal relationship of electrocardiogram

    Yeongyeon Na, Minje Park, Yunwon Tae, and Sunghoon Joo. Guiding masked representa- tion learning to capture spatio-temporal relationship of electrocardiogram. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

  45. [45]

    Contrastive intra-and inter-modality generation for enhancing incomplete multimedia recommendation

    Zhenghong Lin, Yanchao Tan, Yunfei Zhan, Weiming Liu, Fan Wang, Chaochao Chen, Shiping Wang, and Carl Yang. Contrastive intra-and inter-modality generation for enhancing incomplete multimedia recommendation. InProceedings of the 31st ACM International Conference on Multimedia (MM), pages 6234–6242, 2023

  46. [46]

    Qi Shen, Junchang Xin, Bing T Dai, Shudi Zhang, and Zhiqiong Wang. Robust sleep staging over incomplete multimodal physiological signals via contrastive imagination.Advances in Neural Information Processing Systems (NeurIPS), 37:112025–112049, 2024

  47. [47]

    Babel: A scalable pre-trained model for multi-modal sensing via expandable modality alignment

    Shenghong Dai, Shiqi Jiang, Yifan Yang, Ting Cao, Mo Li, Suman Banerjee, and Lili Qiu. Babel: A scalable pre-trained model for multi-modal sensing via expandable modality alignment. InProceedings of the 23rd ACM Conference on Embedded Networked Sensor Systems (SenSys), pages 240–253, 2025

  48. [48]

    Mmbind: Unleashing the potential of distributed and heterogeneous data for multimodal learning in iot

    Xiaomin Ouyang, Jason Wu, Tomoyoshi Kimura, Yihan Lin, Gunjan Verma, Tarek Abdelzaher, and Mani Srivastava. Mmbind: Unleashing the potential of distributed and heterogeneous data for multimodal learning in iot. InProceedings of the 23rd ACM Conference on Embedded Networked Sensor Systems (SenSys), pages 491–503, 2025

  49. [49]

    Maestro: Adaptive sparse attention and robust learning for multimodal dynamic time series

    Payal Mohapatra, Yueyuan Sui, Akash Pandey, Stephen Xia, and Qi Zhu. Maestro: Adaptive sparse attention and robust learning for multimodal dynamic time series. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  50. [50]

    Multimodal patient representation learning with missing modalities and labels

    Zhenbang Wu, Anant Dadu, Nicholas Tustison, Brian Avants, Mike Nalls, Jimeng Sun, and Faraz Faghri. Multimodal patient representation learning with missing modalities and labels. In The Twelfth International Conference on Learning Representations (ICLR), 2024

  51. [51]

    Incomplete multimodality-diffused emotion recognition

    Yuanzhi Wang, Yong Li, and Zhen Cui. Incomplete multimodality-diffused emotion recognition. Advances in Neural Information Processing Systems (NeurIPS), 36:17117–17128, 2023

  52. [52]

    Factorized contrastive learning: Going beyond multi-view redundancy.Advances in Neural Information Processing Systems (NeurIPS), 36:32971–32998, 2023

    Paul Pu Liang, Zihao Deng, Martin Q Ma, James Y Zou, Louis-Philippe Morency, and Ruslan Salakhutdinov. Factorized contrastive learning: Going beyond multi-view redundancy.Advances in Neural Information Processing Systems (NeurIPS), 36:32971–32998, 2023. 13 A Related Works Wearable / Physiological Foundation Models and Self-Supervised Pretraining.Self-supe...