VCR: Learning Valid Contextual Representation for Incomplete Wearable Signals

Qijia Shao; Wenhan Luo; Yuxuan Weng

arxiv: 2605.18837 · v1 · pith:AAWXV46Onew · submitted 2026-05-13 · 💻 cs.LG · cs.AI· eess.SP

VCR: Learning Valid Contextual Representation for Incomplete Wearable Signals

Yuxuan Weng , Wenhan Luo , Qijia Shao This is my paper

Pith reviewed 2026-05-20 22:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AIeess.SP

keywords wearable signalsmodality missingnessself-supervised learningorthogonal disentanglementhealth monitoringmixture of expertssensor incompleteness

0 comments

The pith

VCR learns valid contextual representations for incomplete wearable signals by disentangling shared and modality-specific features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VCR, a self-supervised framework designed to extract useful representations from multimodal wearable signals even when sensors are missing or incomplete. Existing approaches often try to reconstruct full missing signals, which risks creating details that cannot be inferred from the available data and harms overall robustness. VCR instead relies on an orthogonal tokenizer that rectifies latent manifolds and applies geometric projection to split each modality into shared semantic information and modality-specific residuals while keeping all original information intact. These tokens feed into a missing-aware mixture-of-experts backbone that adjusts to the current set of available sensors, and the training goal is restricted to reconstructing only the shared components for any missing modalities. This setup produces consistent gains in performance and robustness across health monitoring tasks under full, single-missing, and multiple-missing conditions compared with prior supervised and self-supervised methods.

Core claim

VCR employs an orthogonal tokenizer to enforce strict orthogonal disentanglement by rectifying latent manifolds and applying a geometric projection, separating each modality into shared semantics and modality-specific residuals. This design preserves complete information integrity while serving as a structural foundation for robust learning under modality missingness. The resulting tokens are processed by a missing-aware mixture-of-experts backbone that adapts to varying patterns of modality availability. By constraining the objective to reconstruct only the shared components of missing modalities, VCR effectively mitigates hallucinations of non-inferable modality-specific details and yields

What carries the argument

The orthogonal tokenizer, which rectifies latent manifolds and applies a geometric projection to separate each modality into shared semantics and modality-specific residuals.

If this is right

VCR maintains strong performance when all sensor modalities are available.
The method increases robustness when one or more modalities are missing.
By reconstructing only shared components, the approach avoids generating unsupported details in absent signals.
The missing-aware mixture-of-experts adapts training and inference to any observed pattern of sensor availability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of inferable shared information from non-inferable specifics could apply to other multimodal sensor problems where inputs intermittently drop out.
If the disentanglement holds, the shared representations might transfer more reliably across different wearable devices or health monitoring tasks.
Testing the geometric projection step on datasets with varying missing rates would clarify how much the orthogonality contributes to the observed robustness gains.

Load-bearing premise

An orthogonal tokenizer can enforce strict separation of shared semantics from modality-specific residuals through manifold rectification and geometric projection without any loss of essential information.

What would settle it

Measure whether accuracy on health outcome prediction drops sharply when multiple modalities are removed if the reconstruction objective is changed to include full modality-specific details instead of only shared components.

Figures

Figures reproduced from arXiv: 2605.18837 by Qijia Shao, Wenhan Luo, Yuxuan Weng.

**Figure 1.** Figure 1: (a) Performance under full and single-modal missingness (see Experiments for multiple missingness). We report normalized Macro-F1 (value/max) and MAE (min/value) within each dataset to compare methods across tasks and datasets. Left: comparison to supervised baselines trained on labeled data only. Right: comparison to pretrained/self-supervised baselines. Higher is better. (b) Modality-specific information… view at source ↗

**Figure 2.** Figure 2: Overview of VCR. (a) Orthogonal Tokenizer: During training, encoders map raw signals to embeddings h˜, which are rectified and projected by P into orthogonal shared s and specific p subspaces. Optimization combines cross-modal contrastive learning and intra-modal reconstruction in a full modality setting. During inference, our tokenizer uses a zero-padded placeholder with a “missingness flag" for missing m… view at source ↗

**Figure 3.** Figure 3: (a) t-SNE visualization of the embeddings of the original samples and those with reconstructed ACC signals. (b) Performance Drop vs. Representation Shift in LSM-2. 3.3 Analysis Quantitative Assessment of Disentanglement. To quantify the independence between the modalshared component s and the modal-specific component p, we compute the Hilbert-Schmidt Independence Criterion (HSIC) using a Gaussian kernel… view at source ↗

**Figure 4.** Figure 4: Analysis of the MoE Mechanism. Experts 1–3 capture different dependencies between EDA and the other three modalities, while Expert 7 primarily handles cases where other modalities are missing [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Macro-F1 drop of different ablations compared to VCR. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Distributions of Three Downstream Tasks (WESAD, AAUWSS, and DaLiA). [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Rank evolution during training. H Hyperparameter Settings For the loss weights tuning, we fix λalign = 1 and λrand = 1, so the main coefficients are: λrecon, λwhite, and λstruct. For Tokenizer, λrecon / λalign controls the trade-off between reconstructability and cross-modal shared semantics. If λrecon is too large, the model can offload part of the shared information into the specific branch p, because re… view at source ↗

read the original abstract

Wearable devices enable continuous health monitoring from multimodal signals, but real-world deployment is hindered by limited labeled data and pervasive sensor incompleteness. While large-scale self-supervised pretraining reduces label dependence, most existing methods assume full modality availability. Current approaches for handling modality missingness often reconstruct entire absent signals, which can encourage hallucinating modality-specific details that are not inferable from the observed sensor signals and degrade robustness. We propose VCR, a self-supervised framework that learns to extract valid representations robust to modality missingness. VCR employs an orthogonal tokenizer to enforce strict orthogonal disentanglement by rectifying latent manifolds and applying a geometric projection, separating each modality into shared semantics and modality-specific residuals. This design preserves complete information integrity while serving as a structural foundation for robust learning under modality missingness. The resulting tokens are processed by a missing-aware mixture-of-experts backbone that adapts to varying patterns of modality availability. By constraining the objective to reconstruct only the shared components of missing modalities, VCR effectively mitigates hallucinations of non-inferable modality-specific details. Across multiple health monitoring tasks, VCR consistently improves performance and robustness under full, single-missing, and multiple-missing modality settings compared with strong supervised and self-supervised baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VCR uses an orthogonal tokenizer with geometric projection and shared-only reconstruction in a missing-aware MoE to avoid hallucinating non-inferable details from incomplete wearables, but the abstract supplies no results or ablations to check if any of it works.

read the letter

The paper's main move is to split each modality into shared semantics and specific residuals via an orthogonal tokenizer that rectifies manifolds and projects geometrically, then feed the tokens into a missing-aware mixture-of-experts backbone while limiting reconstruction to only the shared parts of absent modalities. This is meant to keep representations valid when sensors drop out in health monitoring tasks.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes VCR, a self-supervised framework for learning valid contextual representations from incomplete multimodal wearable signals in health monitoring. It introduces an orthogonal tokenizer that rectifies latent manifolds and applies a geometric projection to disentangle each modality into shared semantics and modality-specific residuals while preserving information integrity. These tokens feed a missing-aware mixture-of-experts backbone that adapts to availability patterns and reconstructs only shared components of missing modalities to avoid hallucinating non-inferable details. The central claim is that VCR yields consistent gains in performance and robustness versus strong supervised and self-supervised baselines under full, single-missing, and multiple-missing modality settings.

Significance. If the empirical claims hold, the work addresses a practical deployment barrier for continuous wearable health monitoring by providing a structural solution to modality missingness that avoids reconstruction hallucinations. The orthogonal disentanglement via geometric projection offers a principled alternative to standard imputation or masking strategies in multimodal self-supervised learning.

major comments (2)

[Method section, orthogonal tokenizer] Method section, orthogonal tokenizer: the claim that latent manifold rectification plus geometric projection produces strictly orthogonal shared semantics while preserving all inferable shared information is load-bearing for the missing-aware reconstruction objective and the reported robustness gains. No information-theoretic bound or ablation isolating information loss under nonlinear, time-varying correlations (typical of wearable signals) is provided, leaving open the possibility that recoverable shared content is discarded or non-inferable specifics leak into the shared tokens.
[Results section] Results section: the abstract asserts consistent improvements across missingness settings, yet the provided text supplies no quantitative metrics, error bars, statistical tests, or ablations that isolate the orthogonal tokenizer's contribution from the mixture-of-experts backbone. This absence prevents verification that the claimed robustness stems from the proposed disentanglement rather than other factors.

minor comments (1)

[Introduction] Clarify the precise definition of 'valid contextual representation' and how the geometric projection differs from standard orthogonalization techniques (e.g., Gram-Schmidt) already used in multimodal learning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the orthogonal tokenizer and empirical validation. We address each major comment point by point below.

read point-by-point responses

Referee: Method section, orthogonal tokenizer: the claim that latent manifold rectification plus geometric projection produces strictly orthogonal shared semantics while preserving all inferable shared information is load-bearing for the missing-aware reconstruction objective and the reported robustness gains. No information-theoretic bound or ablation isolating information loss under nonlinear, time-varying correlations (typical of wearable signals) is provided, leaving open the possibility that recoverable shared content is discarded or non-inferable specifics leak into the shared tokens.

Authors: We agree that the orthogonality claim is central. The geometric projection is designed to enforce orthogonality by construction after manifold rectification, mapping components to orthogonal subspaces. Preservation of inferable shared information is encouraged by the reconstruction objective targeting only shared components. We acknowledge the absence of a formal information-theoretic bound for nonlinear cases. In revision, we will add an ablation using synthetic signals with controlled nonlinear correlations to quantify retention and leakage in shared tokens. revision: yes
Referee: Results section: the abstract asserts consistent improvements across missingness settings, yet the provided text supplies no quantitative metrics, error bars, statistical tests, or ablations that isolate the orthogonal tokenizer's contribution from the mixture-of-experts backbone. This absence prevents verification that the claimed robustness stems from the proposed disentanglement rather than other factors.

Authors: The full manuscript's Results section (Section 4) presents quantitative metrics with means and standard deviations as error bars, along with statistical tests and ablations comparing models with and without the orthogonal tokenizer. To better isolate its contribution, we will expand the revision with a dedicated table and analysis breaking down the tokenizer's impact across full and missing-modality settings. revision: yes

Circularity Check

0 steps flagged

No significant circularity: VCR framework introduces independent structural constraints

full rationale

The paper presents VCR as a self-supervised framework relying on an orthogonal tokenizer (via latent manifold rectification and geometric projection) and a missing-aware mixture-of-experts backbone. These are introduced as novel design choices that enforce disentanglement and constrain reconstruction to shared components only. The performance claims rest on empirical comparisons under full, single-missing, and multi-missing settings rather than any derivation that reduces by construction to fitted parameters, self-citations, or renamed inputs. No load-bearing step equates a claimed prediction or uniqueness result to its own inputs; the derivation chain remains self-contained with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that wearable signals admit clean separation into shared semantics and modality-specific residuals, plus the modeling choice that constraining reconstruction to shared components avoids hallucination; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Wearable multimodal signals contain separable shared semantics and modality-specific residuals that can be isolated via orthogonal projection without loss of integrity.
Invoked as the structural foundation for the orthogonal tokenizer and robust learning under missingness.

pith-pipeline@v0.9.0 · 5747 in / 1154 out tokens · 42271 ms · 2026-05-20T22:08:48.296584+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ P to project h onto the modal-shared subspace, yielding the shared representation: s=Ph. ... p=h−s=(I−P)h ... Cov(s,p)≈0 ... I(s;p)≈0
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

QR decomposition on W ... extract first r columns of Q to form basis B ... P=BB⊤ ... orthogonal projection operator

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 1 internal anchor

[1]

Introducing wesad, a multimodal dataset for wearable stress and affect detection

Philip Schmidt, Attila Reiss, Robert Duerichen, Claus Marberger, and Kristof Van Laerhoven. Introducing wesad, a multimodal dataset for wearable stress and affect detection. InProceedings of the 20th ACM international conference on multimodal interaction, pages 400–408, 2018

work page 2018
[2]

Rummana Bari, Md Mahbubur Rahman, Nazir Saleheen, Megan Battles Parsons, Eugene H Buder, and Santosh Kumar. Automated detection of stressful conversations using wearable physiological and inertial sensors.Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, 4(4):1–23, 2020

work page 2020
[3]

Bing Zhai, Ignacio Perez-Pozuelo, Emma AD Clifton, Joao Palotti, and Yu Guan. Making sense of sleep: Multimodal sleep stage classification in a large, diverse population using movement and cardiac sensing.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(2):1–33, 2020

work page 2020
[4]

Real-Time Sleep Staging using Deep Learning on a Smartphone for a Wearable EEG

Abhay Koushik, Judith Amores, and Pattie Maes. Real-time sleep staging using deep learning on a smartphone for a wearable eeg.arXiv preprint arXiv:1811.10111, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Taoran Sheng and Manfred Huber. Weakly supervised multi-task representation learning for human activity analysis using wearables.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(2):1–18, 2020

work page 2020
[6]

Shengzhong Liu, Shuochao Yao, Jinyang Li, Dongxin Liu, Tianshi Wang, Huajie Shao, and Tarek Abdelzaher. Giobalfusion: A global attentional deep learning framework for multisensor information fusion.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–27, 2020

work page 2020
[7]

Beyond sensor data: Foundation models of behavioral data from wearables improve health predictions

Eray Erturk, Fahad Kamran, Salar Abbaspourazad, Sean Jewell, Harsh Sharma, Yujie Li, Sinead Williamson, Nicholas J Foti, and Joseph Futoma. Beyond sensor data: Foundation models of behavioral data from wearables improve health predictions. InProceedings of the 42nd International Conference on Machine Learning (ICML), pages 15516–15541. PMLR, 2025

work page 2025
[8]

Tailor, Jacob Sunshine, Yun Liu, Tim Althoff, Shrikanth Narayanan, Pushmeet Kohli, Jiening Zhan, Mark Malhotra, Shwetak Patel, Samy Abdel-Ghaffar, and Daniel McDuff

Girish Narayanswamy, Xin Liu, Kumar Ayush, Yuzhe Yang, Xuhai Xu, Shun Liao, Jake Garrison, Shyam A. Tailor, Jacob Sunshine, Yun Liu, Tim Althoff, Shrikanth Narayanan, Pushmeet Kohli, Jiening Zhan, Mark Malhotra, Shwetak Patel, Samy Abdel-Ghaffar, and Daniel McDuff. Scaling wearable foundation models. InThe Thirteenth International Conference on Learning R...

work page 2025
[9]

Relcon: Relative contrastive learning for a motion foundation model for wearable data

Maxwell A Xu, Jaya Narain, Gregory Darnell, Haraldur T Hallgrimsson, Hyewon Jeong, Darren Forde, Richard Andres Fineman, Karthik Jayaraman Raghuram, James Matthew Rehg, and Shirley You Ren. Relcon: Relative contrastive learning for a motion foundation model for wearable data. InThe Thirteenth International Conference on Learning Representations (ICLR), 2024

work page 2024
[10]

Papagei: Open foundation models for optical physiological signals

Arvind Pillai, Dimitris Spathis, Fahim Kawsar, and Mohammad Malekzadeh. Papagei: Open foundation models for optical physiological signals. InThe Thirteenth International Conference on Learning Representations (ICLR), 2024

work page 2024
[11]

Sensorlm: Learning the language of wearable sensors.Advances in neural information processing systems (NeurIPS), 2025

Yuwei Zhang, Kumar Ayush, Siyuan Qiao, A Ali Heydari, Girish Narayanswamy, Maxwell A Xu, Ahmed A Metwally, Shawn Xu, Jake Garrison, Xuhai Xu, et al. Sensorlm: Learning the language of wearable sensors.Advances in neural information processing systems (NeurIPS), 2025

work page 2025
[12]

Contrastive multiview coding

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. InEuropean conference on computer vision (ECCV), pages 776–794. Springer, 2020

work page 2020
[13]

Masked siamese networks for label-efficient learning

Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Mike Rabbat, and Nicolas Ballas. Masked siamese networks for label-efficient learning. InComputer Vision – ECCV 2022, pages 456–473. Springer, 2022. doi: 10.1007/978-3-031-19821-2_26. 10

work page doi:10.1007/978-3-031-19821-2_26 2022
[14]

Self-supervised contrastive pre-training for time series via time-frequency consistency.Advances in neural information processing systems (NeurIPS), 35:3988–4003, 2022

Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. Self-supervised contrastive pre-training for time series via time-frequency consistency.Advances in neural information processing systems (NeurIPS), 35:3988–4003, 2022

work page 2022
[15]

Xu, Girish Narayanswamy, Kumar Ayush, Dimitris Spathis, Shun Liao, Shyam A

Maxwell A. Xu, Girish Narayanswamy, Kumar Ayush, Dimitris Spathis, Shun Liao, Shyam A. Tailor, Ahmed Metwally, A. Ali Heydari, Yuwei Zhang, Jake Garrison, Samy Abdel-Ghaffar, Xuhai Xu, Ken Gu, Jacob Sunshine, Ming-Zher Poh, Yun Liu, Tim Althoff, Shrikanth Narayanan, Pushmeet Kohli, Mark Malhotra, Shwetak Patel, Yuzhe Yang, James M. Rehg, Xin Liu, and Dani...

work page arXiv 2025
[16]

Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts

Sukwon Yun, Inyoung Choi, Jie Peng, Yangfan Wu, Jingxuan Bao, Qiyiwen Zhang, Jiayi Xin, Qi Long, and Tianlong Chen. Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, 2024. doi: 10.52202/079017-3135

work page doi:10.52202/079017-3135 2024
[17]

Fusemoe: mixture-of-experts transformers for fleximodal fusion

Xing Han, Huy Nguyen, Carl Harris, Nhat Ho, and Suchi Saria. Fusemoe: mixture-of-experts transformers for fleximodal fusion. InProceedings of the 38th International Conference on Neural Information Processing Systems (NeurIPS), pages 67850–67900, 2024

work page 2024
[18]

Domain separation networks

Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 29, pages 343–351, 2016

work page 2016
[19]

Ssole: Rethinking orthogonal low-rank em- bedding for self-supervised learning

Lun Huang, Qiang Qiu, and Guillermo Sapiro. Ssole: Rethinking orthogonal low-rank em- bedding for self-supervised learning. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025
[20]

Focal: Contrastive learning for multimodal time-series sensing signals in factorized orthogonal latent space

Shengzhong Liu, Tomoyoshi Kimura, Dongxin Liu, Ruijie Wang, Jinyang Li, Suhas Diggavi, Mani Srivastava, and Tarek Abdelzaher. Focal: Contrastive learning for multimodal time-series sensing signals in factorized orthogonal latent space. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

work page 2023
[21]

Barlow twins: Self- supervised learning via redundancy reduction

Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self- supervised learning via redundancy reduction. InInternational conference on machine learning (ICML), pages 12310–12320. PMLR, 2021

work page 2021
[22]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119 ofProceedings of Machine Learning Research, pages 1597–1607, 2020

work page 2020
[23]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations (ICLR), 2017

work page 2017
[24]

Data2vec: A general framework for self-supervised learning in speech, vision and language

Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International conference on machine learning (ICML), pages 1298–1312. PMLR, 2022

work page 2022
[25]

Masked feature prediction for self-supervised visual pre-training

Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichten- hofer. Masked feature prediction for self-supervised visual pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 14668–14678, 2022

work page 2022
[26]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 16000–16009, 2022

work page 2022
[27]

Engineering digital biomarkers of interstitial glucose from noninvasive smartwatches.NPJ Digital Medicine, 4(1):89, 2021

Brinnae Bent, Peter J Cho, Maria Henriquez, April Wittmann, Connie Thacker, Mark Feinglos, Matthew J Crowley, and Jessilyn P Dunn. Engineering digital biomarkers of interstitial glucose from noninvasive smartwatches.NPJ Digital Medicine, 4(1):89, 2021. 11

work page 2021
[28]

Fusion of learned representations for multimodal sensor data classification

Lee B Hinkle, Gentry Atkinson, and Vangelis Metsis. Fusion of learned representations for multimodal sensor data classification. InIFIP International Conference on Artificial Intelligence Applications and Innovations, pages 404–415. Springer, 2023

work page 2023
[29]

Can-stress: A real-world multimodal dataset for understanding cannabis use, stress, and physiological responses.arXiv preprint arXiv:2503.19935, 2025

Reza Rahimi Azghan, Nicholas C Glodosky, Ramesh Kumar Sah, Carrie Cuttler, Ryan McLaugh- lin, Michael J Cleveland, and Hassan Ghasemzadeh. Can-stress: A real-world multimodal dataset for understanding cannabis use, stress, and physiological responses.arXiv preprint arXiv:2503.19935, 2025

work page arXiv 2025
[30]

Deep ppg: Large-scale heart rate estimation with convolutional neural networks.Sensors, 19(14):3079, 2019

Attila Reiss, Ina Indlekofer, Philip Schmidt, and Kristof Van Laerhoven. Deep ppg: Large-scale heart rate estimation with convolutional neural networks.Sensors, 19(14):3079, 2019

work page 2019
[31]

Nielsen, and Anders Bruun

Shagen Djanian, Thomas Dyhre Nielsen, Søren H. Nielsen, and Anders Bruun. Aalborg University Wearable Sleep Study (AAUWSS), August 2025. URL https://doi.org/10. 5281/zenodo.16919071. Version 1.0

work page 2025
[32]

A multidevice and multimodal dataset for human energy expenditure estimation using wearable devices.Scientific Data, 9(1):537, 2022

Shkurta Gashi, Chulhong Min, Alessandro Montanari, Silvia Santini, and Fahim Kawsar. A multidevice and multimodal dataset for human energy expenditure estimation using wearable devices.Scientific Data, 9(1):537, 2022

work page 2022
[33]

URLhttps://ieeexplore.ieee.org/document/7780459

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi: 10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016
[34]

An image is worth 16×16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[35]

Mf-clr: Multi-frequency contrastive learning representation for time series

Jufang Duan, Wei Zheng, Yangzhou Du, Wenfa Wu, Haipeng Jiang, and Hongsheng Qi. Mf-clr: Multi-frequency contrastive learning representation for time series. InForty-first International Conference on Machine Learning (ICML), 2024

work page 2024
[36]

Large-scale training of foundation models for wearable biosignals

Salar Abbaspourazad, Oussama Elachqar, Andrew Miller, Saba Emrani, Udhyakumar Nallasamy, and Ian Shapiro. Large-scale training of foundation models for wearable biosignals. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024
[37]

Clocs: Contrastive learning of cardiac signals across space, time, and patients

Dani Kiyasseh, Tingting Zhu, and David A Clifton. Clocs: Contrastive learning of cardiac signals across space, time, and patients. InInternational Conference on Machine Learning (ICML), pages 5606–5615. PMLR, 2021

work page 2021
[38]

Unsupervised representation learning for time series with temporal neighborhood coding

Sana Tonekaboni, Danny Eytan, and Anna Goldenberg. Unsupervised representation learning for time series with temporal neighborhood coding. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[39]

Ts2vec: Towards universal representation of time series

Zhihan Yue, Yujing Wang, Juanyong Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong, and Bixiong Xu. Ts2vec: Towards universal representation of time series. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 8980–8987, 2022

work page 2022
[40]

A time series is worth 64 words: Long-term forecasting with transformers

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

work page 2023
[41]

Moment: A family of open time-series foundation models

Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: A family of open time-series foundation models. InInternational Conference on Machine Learning, pages 16115–16152. PMLR, 2024

work page 2024
[42]

Simmtm: A simple pre-training framework for masked time-series modeling.Advances in Neural Information Processing Systems (NeurIPS), 36:29996–30025, 2023

Jiaxiang Dong, Haixu Wu, Haoran Zhang, Li Zhang, Jianmin Wang, and Mingsheng Long. Simmtm: A simple pre-training framework for masked time-series modeling.Advances in Neural Information Processing Systems (NeurIPS), 36:29996–30025, 2023

work page 2023
[43]

Time-series representation learning via temporal and contextual contrasting

Emadeldeen Eldele, Mohamed Ragab, Zhenghua Chen, Min Wu, Chee Keong Kwoh, Xiaoli Li, and Cuntai Guan. Time-series representation learning via temporal and contextual contrasting. InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI), pages 2352–2359. International Joint Conferences on Artificial Intelligence Orga...

work page 2021
[44]

Guiding masked representa- tion learning to capture spatio-temporal relationship of electrocardiogram

Yeongyeon Na, Minje Park, Yunwon Tae, and Sunghoon Joo. Guiding masked representa- tion learning to capture spatio-temporal relationship of electrocardiogram. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024
[45]

Contrastive intra-and inter-modality generation for enhancing incomplete multimedia recommendation

Zhenghong Lin, Yanchao Tan, Yunfei Zhan, Weiming Liu, Fan Wang, Chaochao Chen, Shiping Wang, and Carl Yang. Contrastive intra-and inter-modality generation for enhancing incomplete multimedia recommendation. InProceedings of the 31st ACM International Conference on Multimedia (MM), pages 6234–6242, 2023

work page 2023
[46]

Qi Shen, Junchang Xin, Bing T Dai, Shudi Zhang, and Zhiqiong Wang. Robust sleep staging over incomplete multimodal physiological signals via contrastive imagination.Advances in Neural Information Processing Systems (NeurIPS), 37:112025–112049, 2024

work page 2024
[47]

Babel: A scalable pre-trained model for multi-modal sensing via expandable modality alignment

Shenghong Dai, Shiqi Jiang, Yifan Yang, Ting Cao, Mo Li, Suman Banerjee, and Lili Qiu. Babel: A scalable pre-trained model for multi-modal sensing via expandable modality alignment. InProceedings of the 23rd ACM Conference on Embedded Networked Sensor Systems (SenSys), pages 240–253, 2025

work page 2025
[48]

Mmbind: Unleashing the potential of distributed and heterogeneous data for multimodal learning in iot

Xiaomin Ouyang, Jason Wu, Tomoyoshi Kimura, Yihan Lin, Gunjan Verma, Tarek Abdelzaher, and Mani Srivastava. Mmbind: Unleashing the potential of distributed and heterogeneous data for multimodal learning in iot. InProceedings of the 23rd ACM Conference on Embedded Networked Sensor Systems (SenSys), pages 491–503, 2025

work page 2025
[49]

Maestro: Adaptive sparse attention and robust learning for multimodal dynamic time series

Payal Mohapatra, Yueyuan Sui, Akash Pandey, Stephen Xia, and Qi Zhu. Maestro: Adaptive sparse attention and robust learning for multimodal dynamic time series. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[50]

Multimodal patient representation learning with missing modalities and labels

Zhenbang Wu, Anant Dadu, Nicholas Tustison, Brian Avants, Mike Nalls, Jimeng Sun, and Faraz Faghri. Multimodal patient representation learning with missing modalities and labels. In The Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024
[51]

Incomplete multimodality-diffused emotion recognition

Yuanzhi Wang, Yong Li, and Zhen Cui. Incomplete multimodality-diffused emotion recognition. Advances in Neural Information Processing Systems (NeurIPS), 36:17117–17128, 2023

work page 2023
[52]

Factorized contrastive learning: Going beyond multi-view redundancy.Advances in Neural Information Processing Systems (NeurIPS), 36:32971–32998, 2023

Paul Pu Liang, Zihao Deng, Martin Q Ma, James Y Zou, Louis-Philippe Morency, and Ruslan Salakhutdinov. Factorized contrastive learning: Going beyond multi-view redundancy.Advances in Neural Information Processing Systems (NeurIPS), 36:32971–32998, 2023. 13 A Related Works Wearable / Physiological Foundation Models and Self-Supervised Pretraining.Self-supe...

work page 2023

[1] [1]

Introducing wesad, a multimodal dataset for wearable stress and affect detection

Philip Schmidt, Attila Reiss, Robert Duerichen, Claus Marberger, and Kristof Van Laerhoven. Introducing wesad, a multimodal dataset for wearable stress and affect detection. InProceedings of the 20th ACM international conference on multimodal interaction, pages 400–408, 2018

work page 2018

[2] [2]

Rummana Bari, Md Mahbubur Rahman, Nazir Saleheen, Megan Battles Parsons, Eugene H Buder, and Santosh Kumar. Automated detection of stressful conversations using wearable physiological and inertial sensors.Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, 4(4):1–23, 2020

work page 2020

[3] [3]

Bing Zhai, Ignacio Perez-Pozuelo, Emma AD Clifton, Joao Palotti, and Yu Guan. Making sense of sleep: Multimodal sleep stage classification in a large, diverse population using movement and cardiac sensing.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(2):1–33, 2020

work page 2020

[4] [4]

Real-Time Sleep Staging using Deep Learning on a Smartphone for a Wearable EEG

Abhay Koushik, Judith Amores, and Pattie Maes. Real-time sleep staging using deep learning on a smartphone for a wearable eeg.arXiv preprint arXiv:1811.10111, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Taoran Sheng and Manfred Huber. Weakly supervised multi-task representation learning for human activity analysis using wearables.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(2):1–18, 2020

work page 2020

[6] [6]

Shengzhong Liu, Shuochao Yao, Jinyang Li, Dongxin Liu, Tianshi Wang, Huajie Shao, and Tarek Abdelzaher. Giobalfusion: A global attentional deep learning framework for multisensor information fusion.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–27, 2020

work page 2020

[7] [7]

Beyond sensor data: Foundation models of behavioral data from wearables improve health predictions

Eray Erturk, Fahad Kamran, Salar Abbaspourazad, Sean Jewell, Harsh Sharma, Yujie Li, Sinead Williamson, Nicholas J Foti, and Joseph Futoma. Beyond sensor data: Foundation models of behavioral data from wearables improve health predictions. InProceedings of the 42nd International Conference on Machine Learning (ICML), pages 15516–15541. PMLR, 2025

work page 2025

[8] [8]

Tailor, Jacob Sunshine, Yun Liu, Tim Althoff, Shrikanth Narayanan, Pushmeet Kohli, Jiening Zhan, Mark Malhotra, Shwetak Patel, Samy Abdel-Ghaffar, and Daniel McDuff

Girish Narayanswamy, Xin Liu, Kumar Ayush, Yuzhe Yang, Xuhai Xu, Shun Liao, Jake Garrison, Shyam A. Tailor, Jacob Sunshine, Yun Liu, Tim Althoff, Shrikanth Narayanan, Pushmeet Kohli, Jiening Zhan, Mark Malhotra, Shwetak Patel, Samy Abdel-Ghaffar, and Daniel McDuff. Scaling wearable foundation models. InThe Thirteenth International Conference on Learning R...

work page 2025

[9] [9]

Relcon: Relative contrastive learning for a motion foundation model for wearable data

Maxwell A Xu, Jaya Narain, Gregory Darnell, Haraldur T Hallgrimsson, Hyewon Jeong, Darren Forde, Richard Andres Fineman, Karthik Jayaraman Raghuram, James Matthew Rehg, and Shirley You Ren. Relcon: Relative contrastive learning for a motion foundation model for wearable data. InThe Thirteenth International Conference on Learning Representations (ICLR), 2024

work page 2024

[10] [10]

Papagei: Open foundation models for optical physiological signals

Arvind Pillai, Dimitris Spathis, Fahim Kawsar, and Mohammad Malekzadeh. Papagei: Open foundation models for optical physiological signals. InThe Thirteenth International Conference on Learning Representations (ICLR), 2024

work page 2024

[11] [11]

Sensorlm: Learning the language of wearable sensors.Advances in neural information processing systems (NeurIPS), 2025

Yuwei Zhang, Kumar Ayush, Siyuan Qiao, A Ali Heydari, Girish Narayanswamy, Maxwell A Xu, Ahmed A Metwally, Shawn Xu, Jake Garrison, Xuhai Xu, et al. Sensorlm: Learning the language of wearable sensors.Advances in neural information processing systems (NeurIPS), 2025

work page 2025

[12] [12]

Contrastive multiview coding

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. InEuropean conference on computer vision (ECCV), pages 776–794. Springer, 2020

work page 2020

[13] [13]

Masked siamese networks for label-efficient learning

Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Mike Rabbat, and Nicolas Ballas. Masked siamese networks for label-efficient learning. InComputer Vision – ECCV 2022, pages 456–473. Springer, 2022. doi: 10.1007/978-3-031-19821-2_26. 10

work page doi:10.1007/978-3-031-19821-2_26 2022

[14] [14]

Self-supervised contrastive pre-training for time series via time-frequency consistency.Advances in neural information processing systems (NeurIPS), 35:3988–4003, 2022

Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. Self-supervised contrastive pre-training for time series via time-frequency consistency.Advances in neural information processing systems (NeurIPS), 35:3988–4003, 2022

work page 2022

[15] [15]

Xu, Girish Narayanswamy, Kumar Ayush, Dimitris Spathis, Shun Liao, Shyam A

Maxwell A. Xu, Girish Narayanswamy, Kumar Ayush, Dimitris Spathis, Shun Liao, Shyam A. Tailor, Ahmed Metwally, A. Ali Heydari, Yuwei Zhang, Jake Garrison, Samy Abdel-Ghaffar, Xuhai Xu, Ken Gu, Jacob Sunshine, Ming-Zher Poh, Yun Liu, Tim Althoff, Shrikanth Narayanan, Pushmeet Kohli, Mark Malhotra, Shwetak Patel, Yuzhe Yang, James M. Rehg, Xin Liu, and Dani...

work page arXiv 2025

[16] [16]

Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts

Sukwon Yun, Inyoung Choi, Jie Peng, Yangfan Wu, Jingxuan Bao, Qiyiwen Zhang, Jiayi Xin, Qi Long, and Tianlong Chen. Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, 2024. doi: 10.52202/079017-3135

work page doi:10.52202/079017-3135 2024

[17] [17]

Fusemoe: mixture-of-experts transformers for fleximodal fusion

Xing Han, Huy Nguyen, Carl Harris, Nhat Ho, and Suchi Saria. Fusemoe: mixture-of-experts transformers for fleximodal fusion. InProceedings of the 38th International Conference on Neural Information Processing Systems (NeurIPS), pages 67850–67900, 2024

work page 2024

[18] [18]

Domain separation networks

Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 29, pages 343–351, 2016

work page 2016

[19] [19]

Ssole: Rethinking orthogonal low-rank em- bedding for self-supervised learning

Lun Huang, Qiang Qiu, and Guillermo Sapiro. Ssole: Rethinking orthogonal low-rank em- bedding for self-supervised learning. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025

[20] [20]

Focal: Contrastive learning for multimodal time-series sensing signals in factorized orthogonal latent space

Shengzhong Liu, Tomoyoshi Kimura, Dongxin Liu, Ruijie Wang, Jinyang Li, Suhas Diggavi, Mani Srivastava, and Tarek Abdelzaher. Focal: Contrastive learning for multimodal time-series sensing signals in factorized orthogonal latent space. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

work page 2023

[21] [21]

Barlow twins: Self- supervised learning via redundancy reduction

Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self- supervised learning via redundancy reduction. InInternational conference on machine learning (ICML), pages 12310–12320. PMLR, 2021

work page 2021

[22] [22]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119 ofProceedings of Machine Learning Research, pages 1597–1607, 2020

work page 2020

[23] [23]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations (ICLR), 2017

work page 2017

[24] [24]

Data2vec: A general framework for self-supervised learning in speech, vision and language

Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International conference on machine learning (ICML), pages 1298–1312. PMLR, 2022

work page 2022

[25] [25]

Masked feature prediction for self-supervised visual pre-training

Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichten- hofer. Masked feature prediction for self-supervised visual pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 14668–14678, 2022

work page 2022

[26] [26]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 16000–16009, 2022

work page 2022

[27] [27]

Engineering digital biomarkers of interstitial glucose from noninvasive smartwatches.NPJ Digital Medicine, 4(1):89, 2021

Brinnae Bent, Peter J Cho, Maria Henriquez, April Wittmann, Connie Thacker, Mark Feinglos, Matthew J Crowley, and Jessilyn P Dunn. Engineering digital biomarkers of interstitial glucose from noninvasive smartwatches.NPJ Digital Medicine, 4(1):89, 2021. 11

work page 2021

[28] [28]

Fusion of learned representations for multimodal sensor data classification

Lee B Hinkle, Gentry Atkinson, and Vangelis Metsis. Fusion of learned representations for multimodal sensor data classification. InIFIP International Conference on Artificial Intelligence Applications and Innovations, pages 404–415. Springer, 2023

work page 2023

[29] [29]

Can-stress: A real-world multimodal dataset for understanding cannabis use, stress, and physiological responses.arXiv preprint arXiv:2503.19935, 2025

Reza Rahimi Azghan, Nicholas C Glodosky, Ramesh Kumar Sah, Carrie Cuttler, Ryan McLaugh- lin, Michael J Cleveland, and Hassan Ghasemzadeh. Can-stress: A real-world multimodal dataset for understanding cannabis use, stress, and physiological responses.arXiv preprint arXiv:2503.19935, 2025

work page arXiv 2025

[30] [30]

Deep ppg: Large-scale heart rate estimation with convolutional neural networks.Sensors, 19(14):3079, 2019

Attila Reiss, Ina Indlekofer, Philip Schmidt, and Kristof Van Laerhoven. Deep ppg: Large-scale heart rate estimation with convolutional neural networks.Sensors, 19(14):3079, 2019

work page 2019

[31] [31]

Nielsen, and Anders Bruun

Shagen Djanian, Thomas Dyhre Nielsen, Søren H. Nielsen, and Anders Bruun. Aalborg University Wearable Sleep Study (AAUWSS), August 2025. URL https://doi.org/10. 5281/zenodo.16919071. Version 1.0

work page 2025

[32] [32]

A multidevice and multimodal dataset for human energy expenditure estimation using wearable devices.Scientific Data, 9(1):537, 2022

Shkurta Gashi, Chulhong Min, Alessandro Montanari, Silvia Santini, and Fahim Kawsar. A multidevice and multimodal dataset for human energy expenditure estimation using wearable devices.Scientific Data, 9(1):537, 2022

work page 2022

[33] [33]

URLhttps://ieeexplore.ieee.org/document/7780459

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi: 10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016

[34] [34]

An image is worth 16×16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021

[35] [35]

Mf-clr: Multi-frequency contrastive learning representation for time series

Jufang Duan, Wei Zheng, Yangzhou Du, Wenfa Wu, Haipeng Jiang, and Hongsheng Qi. Mf-clr: Multi-frequency contrastive learning representation for time series. InForty-first International Conference on Machine Learning (ICML), 2024

work page 2024

[36] [36]

Large-scale training of foundation models for wearable biosignals

Salar Abbaspourazad, Oussama Elachqar, Andrew Miller, Saba Emrani, Udhyakumar Nallasamy, and Ian Shapiro. Large-scale training of foundation models for wearable biosignals. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024

[37] [37]

Clocs: Contrastive learning of cardiac signals across space, time, and patients

Dani Kiyasseh, Tingting Zhu, and David A Clifton. Clocs: Contrastive learning of cardiac signals across space, time, and patients. InInternational Conference on Machine Learning (ICML), pages 5606–5615. PMLR, 2021

work page 2021

[38] [38]

Unsupervised representation learning for time series with temporal neighborhood coding

Sana Tonekaboni, Danny Eytan, and Anna Goldenberg. Unsupervised representation learning for time series with temporal neighborhood coding. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021

[39] [39]

Ts2vec: Towards universal representation of time series

Zhihan Yue, Yujing Wang, Juanyong Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong, and Bixiong Xu. Ts2vec: Towards universal representation of time series. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 8980–8987, 2022

work page 2022

[40] [40]

A time series is worth 64 words: Long-term forecasting with transformers

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

work page 2023

[41] [41]

Moment: A family of open time-series foundation models

Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: A family of open time-series foundation models. InInternational Conference on Machine Learning, pages 16115–16152. PMLR, 2024

work page 2024

[42] [42]

Simmtm: A simple pre-training framework for masked time-series modeling.Advances in Neural Information Processing Systems (NeurIPS), 36:29996–30025, 2023

Jiaxiang Dong, Haixu Wu, Haoran Zhang, Li Zhang, Jianmin Wang, and Mingsheng Long. Simmtm: A simple pre-training framework for masked time-series modeling.Advances in Neural Information Processing Systems (NeurIPS), 36:29996–30025, 2023

work page 2023

[43] [43]

Time-series representation learning via temporal and contextual contrasting

Emadeldeen Eldele, Mohamed Ragab, Zhenghua Chen, Min Wu, Chee Keong Kwoh, Xiaoli Li, and Cuntai Guan. Time-series representation learning via temporal and contextual contrasting. InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI), pages 2352–2359. International Joint Conferences on Artificial Intelligence Orga...

work page 2021

[44] [44]

Guiding masked representa- tion learning to capture spatio-temporal relationship of electrocardiogram

Yeongyeon Na, Minje Park, Yunwon Tae, and Sunghoon Joo. Guiding masked representa- tion learning to capture spatio-temporal relationship of electrocardiogram. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024

[45] [45]

Contrastive intra-and inter-modality generation for enhancing incomplete multimedia recommendation

Zhenghong Lin, Yanchao Tan, Yunfei Zhan, Weiming Liu, Fan Wang, Chaochao Chen, Shiping Wang, and Carl Yang. Contrastive intra-and inter-modality generation for enhancing incomplete multimedia recommendation. InProceedings of the 31st ACM International Conference on Multimedia (MM), pages 6234–6242, 2023

work page 2023

[46] [46]

Qi Shen, Junchang Xin, Bing T Dai, Shudi Zhang, and Zhiqiong Wang. Robust sleep staging over incomplete multimodal physiological signals via contrastive imagination.Advances in Neural Information Processing Systems (NeurIPS), 37:112025–112049, 2024

work page 2024

[47] [47]

Babel: A scalable pre-trained model for multi-modal sensing via expandable modality alignment

Shenghong Dai, Shiqi Jiang, Yifan Yang, Ting Cao, Mo Li, Suman Banerjee, and Lili Qiu. Babel: A scalable pre-trained model for multi-modal sensing via expandable modality alignment. InProceedings of the 23rd ACM Conference on Embedded Networked Sensor Systems (SenSys), pages 240–253, 2025

work page 2025

[48] [48]

Mmbind: Unleashing the potential of distributed and heterogeneous data for multimodal learning in iot

Xiaomin Ouyang, Jason Wu, Tomoyoshi Kimura, Yihan Lin, Gunjan Verma, Tarek Abdelzaher, and Mani Srivastava. Mmbind: Unleashing the potential of distributed and heterogeneous data for multimodal learning in iot. InProceedings of the 23rd ACM Conference on Embedded Networked Sensor Systems (SenSys), pages 491–503, 2025

work page 2025

[49] [49]

Maestro: Adaptive sparse attention and robust learning for multimodal dynamic time series

Payal Mohapatra, Yueyuan Sui, Akash Pandey, Stephen Xia, and Qi Zhu. Maestro: Adaptive sparse attention and robust learning for multimodal dynamic time series. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[50] [50]

Multimodal patient representation learning with missing modalities and labels

Zhenbang Wu, Anant Dadu, Nicholas Tustison, Brian Avants, Mike Nalls, Jimeng Sun, and Faraz Faghri. Multimodal patient representation learning with missing modalities and labels. In The Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024

[51] [51]

Incomplete multimodality-diffused emotion recognition

Yuanzhi Wang, Yong Li, and Zhen Cui. Incomplete multimodality-diffused emotion recognition. Advances in Neural Information Processing Systems (NeurIPS), 36:17117–17128, 2023

work page 2023

[52] [52]

Factorized contrastive learning: Going beyond multi-view redundancy.Advances in Neural Information Processing Systems (NeurIPS), 36:32971–32998, 2023

Paul Pu Liang, Zihao Deng, Martin Q Ma, James Y Zou, Louis-Philippe Morency, and Ruslan Salakhutdinov. Factorized contrastive learning: Going beyond multi-view redundancy.Advances in Neural Information Processing Systems (NeurIPS), 36:32971–32998, 2023. 13 A Related Works Wearable / Physiological Foundation Models and Self-Supervised Pretraining.Self-supe...

work page 2023