Contrastive Learning for Multimodal Human Activity Recognition with Limited Labeled Data

Long Jing; Xinlong Feng; Yajun Zhang; Zhixiong Yang

arxiv: 2604.23281 · v1 · submitted 2026-04-25 · 💻 cs.LG · cs.CV

Contrastive Learning for Multimodal Human Activity Recognition with Limited Labeled Data

Long Jing , Zhixiong Yang , Yajun Zhang , Xinlong Feng This is my paper

Pith reviewed 2026-05-08 08:25 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords contrastive learningmultimodal human activity recognitionlimited labeled datatwo-stage trainingcross-modal featuresCNN-DiffTransformer

0 comments

The pith

CLMM uses two-stage contrastive learning to raise multimodal human activity recognition accuracy when labels are scarce.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CLMM as a contrastive learning framework for multimodal human activity recognition that works effectively even when labeled data is limited and sensor inputs are heterogeneous. It proposes a two-stage process that first extracts shared cross-modal features with a CNN-DiffTransformer encoder plus hard-positive weighting, then fuses modality-specific details through a dual-branch setup with quality-guided attention and primary-auxiliary training. A sympathetic reader would care because closing this performance gap could make sensor-based activity systems practical for real-world uses such as health tracking or interactive environments where collecting many labels is costly. Experiments on three public datasets show the approach improves both final accuracy and training convergence over prior baselines.

Core claim

CLMM is a general contrastive learning framework that achieves effective multimodal recognition with limited labeled data. It employs a novel two-stage training strategy. In the first stage, a CNN-DiffTransformer encoder captures cross-modal shared information by extracting local and global features, while a hard-positive samples weighting algorithm enhances gradient propagation. In the second stage, a dual-branch architecture combining quality-guided attention and bidirectional gated units captures modality-specific information, and a primary-auxiliary collaborative training strategy fuses shared and specific information.

What carries the argument

The CLMM two-stage contrastive framework, where the first stage uses a CNN-DiffTransformer encoder with hard-positive weighting to learn shared features and the second stage uses dual-branch quality-guided attention with primary-auxiliary training to integrate modality-specific features.

If this is right

CLMM raises recognition accuracy above current state-of-the-art baselines across three public multimodal datasets.
CLMM reaches higher accuracy faster during training than prior methods.
CLMM handles heterogeneous multi-sensor data effectively even when only limited labels are available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-stage contrastive pattern could be tested on other sensor-fusion tasks such as gesture recognition or environmental monitoring.
Applying CLMM to streaming real-time data would test whether the collaborative training supports low-latency inference.
Adding mechanisms for automatic modality selection might extend the approach to cases where some sensors are intermittently unavailable.

Load-bearing premise

The CNN-DiffTransformer encoder, hard-positive weighting, quality-guided attention, and primary-auxiliary collaborative training will generalize to new multimodal datasets without overfitting to the limited labels in the three tested cases.

What would settle it

Evaluating CLMM on a fourth multimodal human activity dataset with scarce labels and finding that its accuracy or convergence speed does not exceed existing state-of-the-art baselines would disprove the central claim.

Figures

Figures reproduced from arXiv: 2604.23281 by Long Jing, Xinlong Feng, Yajun Zhang, Zhixiong Yang.

**Figure 1.** Figure 1: A typical application of CLMM in multimodal human activity view at source ↗

**Figure 2.** Figure 2: Overview of CLMM. a discriminator distinguishes real from generated samples, effectively combining limited labels with unlabeled data. HMGAN [20] learns both shared and modality-specific features via a generator, and employs a hierarchical discriminator that computes modality-level and global adversarial losses to produce multimodal samples from limited annotations. However, GANs require training both th… view at source ↗

**Figure 4.** Figure 4: Workflow of the hard positive samples weighting algorithm. Low view at source ↗

**Figure 6.** Figure 6: Performance comparison between proposed Primary-Auxiliary view at source ↗

**Figure 5.** Figure 5: Dual-branch architecture extracts modality-specific information view at source ↗

**Figure 9.** Figure 9: Accuracy with/without labeled data from target subjects. M1 M2 M3 0. 5 0. 6 0. 7 0. 8 0. 9 A c c u r a c y Modal i ty Combi n ati on s view at source ↗

**Figure 12.** Figure 12: Convergence Performance of CLMM, CMC and Cosmo on the view at source ↗

**Figure 13.** Figure 13: Hyperparameter sensitivity analysis of CLMM across three datasets. view at source ↗

read the original abstract

Human activity recognition serves as the foundation for various emerging applications. In recent years, researchers have used collaborative sensing of multi-source sensors to capture complex and dynamic human activities. However, multimodal human activity sensing typically encounters highly heterogeneous data across modalities and label scarcity, resulting in an application gap between existing solutions and real-world needs. In this paper, we propose CLMM, a general contrastive learning framework for human activity recognition that achieves effective multimodal recognition with limited labeled data. CLMM employs a novel two-stage training strategy. In the first stage, CLMM employs a CNN-DiffTransformer encoder to capture cross-modal shared information by extracting local and global features. Meanwhile, a hard-positive samples weighting algorithm enhances gradient propagation to reinforce shared learning. In the second stage, a dual-branch architecture combining quality-guided attention and bidirectional gated units captures modality-specific information, while a primary-auxiliary collaborative training strategy fuses both shared and modality-specific information. Experimental results on three public datasets demonstrate that CLMM significantly improves state-of-the-art baselines in both recognition accuracy and convergence performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLMM describes a plausible two-stage contrastive method for label-scarce multimodal HAR, but lacks the quantitative details needed to back its accuracy claims.

read the letter

CLMM is a new contrastive learning framework for multimodal human activity recognition that uses limited labeled data. It splits training into two stages: one for learning shared cross-modal features with a CNN-DiffTransformer and hard-positive weighting, and another for modality-specific details using quality-guided attention and bidirectional gated units with primary-auxiliary training. The paper does a solid job describing how these pieces fit together to address heterogeneity in sensor data and label scarcity. The two-stage strategy and the specific components like hard-positive samples weighting seem like practical additions to existing contrastive approaches in this area. What the work gets right is focusing on a real deployment issue in ambient intelligence and health monitoring, where getting lots of labels is expensive. The architecture choices target both shared and unique information across modalities, which makes sense for multimodal setups. The soft spots are in the evidence. The abstract claims significant improvements in accuracy and convergence on three public datasets but provides no numbers, no ablation results, and no details on the experimental setup. The concern about baselines is important—if the state-of-the-art numbers come from fully supervised training rather than the same limited-label regime, then the gains can't be credited to CLMM. The full paper needs to show that all comparisons used identical label fractions and splits. There are no equations or derivations, so the contribution is mostly in the procedural design rather than new theory. Citation patterns aren't visible here, but the method builds on contrastive learning and transformer ideas without claiming to reinvent them. This paper would interest people working on sensor fusion for activity recognition, especially those trying to make models work with small labeled sets. A reader looking for engineering solutions in multimodal time-series might pick up ideas from the dual-branch and collaborative training parts. I think it deserves peer review. The problem is relevant, the proposed method has enough novelty in its combination of elements, and a referee could help strengthen the experimental section.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes CLMM, a contrastive learning framework for multimodal human activity recognition under limited labeled data. It introduces a two-stage training strategy: stage one uses a CNN-DiffTransformer encoder to extract local and global cross-modal shared features, augmented by a hard-positive samples weighting algorithm; stage two employs a dual-branch architecture with quality-guided attention and bidirectional gated units, combined via primary-auxiliary collaborative training to fuse shared and modality-specific information. The central claim is that CLMM significantly improves recognition accuracy and convergence performance over state-of-the-art baselines on three public datasets.

Significance. If the reported gains are supported by fair, controlled experiments under identical limited-label regimes, the work could meaningfully advance label-efficient multimodal HAR by demonstrating how contrastive pre-training and staged fusion can mitigate data heterogeneity and scarcity. The specific architectural choices (CNN-DiffTransformer, hard-positive weighting, quality-guided attention) represent targeted innovations that, if validated, would be of interest to the sensing and activity recognition communities.

major comments (2)

[Abstract] Abstract: the assertion that CLMM 'significantly improves state-of-the-art baselines in both recognition accuracy and convergence performance' supplies no quantitative numbers, baseline re-implementation details, ablation studies, or statistical tests, rendering the central empirical claim unverifiable from the provided text.
[Experimental protocol] Experimental protocol (presumably §4 or §5): the manuscript does not explicitly confirm that all cited SOTA baselines were re-trained or re-evaluated using the exact same label scarcity fractions, data splits, and limited-label conditions applied to CLMM. If original full-supervision numbers are instead referenced, the accuracy and convergence deltas cannot be attributed to the proposed contrastive components or two-stage strategy.

minor comments (1)

[Method description] The hard-positive weighting algorithm and quality-guided attention mechanism are described procedurally but lack accompanying equations, loss formulations, or pseudocode, which would aid reproducibility and allow readers to assess their precise contribution to gradient flow and modality fusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we have revised the manuscript to enhance clarity and verifiability of our claims and experimental details.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that CLMM 'significantly improves state-of-the-art baselines in both recognition accuracy and convergence performance' supplies no quantitative numbers, baseline re-implementation details, ablation studies, or statistical tests, rendering the central empirical claim unverifiable from the provided text.

Authors: We agree that the abstract would benefit from more specific support for the central claim to allow immediate verification. The detailed quantitative results, including accuracy improvements, convergence performance metrics, ablation studies on components such as the CNN-DiffTransformer and hard-positive weighting, and statistical tests are presented in the experimental section of the manuscript. In the revised manuscript, we have updated the abstract to include key quantitative highlights from our experiments while preserving its brevity. This addresses the verifiability concern directly. revision: yes
Referee: [Experimental protocol] Experimental protocol (presumably §4 or §5): the manuscript does not explicitly confirm that all cited SOTA baselines were re-trained or re-evaluated using the exact same label scarcity fractions, data splits, and limited-label conditions applied to CLMM. If original full-supervision numbers are instead referenced, the accuracy and convergence deltas cannot be attributed to the proposed contrastive components or two-stage strategy.

Authors: We confirm that all state-of-the-art baselines were re-implemented and re-evaluated under the exact same limited-label conditions, including identical label scarcity fractions, data splits, and evaluation protocols as those used for CLMM. This ensures fair comparison and that the observed improvements can be attributed to our proposed two-stage contrastive framework. To make this explicit and eliminate any ambiguity, we have added a clear statement and a dedicated paragraph in the revised experimental protocol section detailing the re-training procedure and confirming the identical settings for all methods. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical procedural framework with no derivations or self-referential predictions

full rationale

The paper describes a two-stage contrastive learning architecture (CNN-DiffTransformer encoder, hard-positive weighting, quality-guided attention, primary-auxiliary training) and reports empirical accuracy gains on three public datasets. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or method description. All claims rest on experimental comparisons rather than any derivation chain that reduces to its own inputs by construction. This is the expected non-finding for an applied ML framework paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the framework is described at a high level without mathematical details.

pith-pipeline@v0.9.0 · 5490 in / 1038 out tokens · 46894 ms · 2026-05-08T08:25:10.364373+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

[1]

Activity recognition in outdoor sports environments: smart data for end-users involving mobile pervasive augmented reality systems,

R. M. Pascoal, A. de Almeida, and R. C. Sofia, “Activity recognition in outdoor sports environments: smart data for end-users involving mobile pervasive augmented reality systems,” inAdjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM Interna- tional Symposium on Wearable...

work page 2019
[2]

When virtual reality meets internet of things in the gym: Enabling immersive interactive machine exercises,

F. Rabbi, T. Park, B. Fang, M. Zhang, and Y . Lee, “When virtual reality meets internet of things in the gym: Enabling immersive interactive machine exercises,”Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, vol. 2, no. 2, pp. 1–21, 2018

work page 2018
[3]

Ok google, what am i doing? acoustic activity recognition bounded by conversational assistant in- teractions,

R. Adaimi, H. Yong, and E. Thomaz, “Ok google, what am i doing? acoustic activity recognition bounded by conversational assistant in- teractions,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 5, no. 1, pp. 1–24, 2021

work page 2021
[4]

Familylog: A mobile system for monitoring family mealtime activities,

C. Bi, G. Xing, T. Hao, J. Huh, W. Peng, and M. Ma, “Familylog: A mobile system for monitoring family mealtime activities,” in2017 ieee international conference on pervasive computing and communications (percom). IEEE, 2017, pp. 21–30

work page 2017
[5]

Ex- ploring the feasibility of remote cardiac auscultation using earphones,

T. Chen, Y . Yang, X. Fan, X. Guo, J. Xiong, and L. Shangguan, “Ex- ploring the feasibility of remote cardiac auscultation using earphones,” inProceedings of the 30th Annual International Conference on Mobile Computing and Networking, 2024, pp. 357–372

work page 2024
[6]

Exploring biomagnetism for inclusive vital sign monitoring: Modeling and implementation,

X. Guo, L. Tan, T. Chen, C. Gu, Y . Shu, S. He, Y . He, J. Chen, and L. Shangguan, “Exploring biomagnetism for inclusive vital sign monitoring: Modeling and implementation,” inProceedings of the 30th Annual International Conference on Mobile Computing and Networking, 2024, pp. 93–107

work page 2024
[7]

Ad- vancing multi-modal sensing through expandable modality alignment,

S. Dai, S. Jiang, Y . Yang, T. Cao, M. Li, S. Banerjee, and L. Qiu, “Ad- vancing multi-modal sensing through expandable modality alignment,” arXiv preprint arXiv:2407.17777, 2024

work page arXiv 2024
[8]

Deepfusion: A deep learning framework for the fusion of heterogeneous sensory data,

H. Xue, W. Jiang, C. Miao, Y . Yuan, F. Ma, X. Ma, Y . Wang, S. Yao, W. Xu, A. Zhanget al., “Deepfusion: A deep learning framework for the fusion of heterogeneous sensory data,” inProceedings of the Twentieth ACM International Symposium on Mobile Ad Hoc Networking and Computing, 2019, pp. 151–160

work page 2019
[9]

Dig- ital biomarkers for alzheimer’s disease: the mobile/wearable devices opportunity,

L. C. Kourtis, O. B. Regele, J. M. Wright, and G. B. Jones, “Dig- ital biomarkers for alzheimer’s disease: the mobile/wearable devices opportunity,”NPJ digital medicine, vol. 2, no. 1, p. 9, 2019

work page 2019
[10]

Contrastive predictive coding for human activity recognition,

H. Haresamudram, I. Essa, and T. Pl ¨otz, “Contrastive predictive coding for human activity recognition,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 5, no. 2, pp. 1–26, 2021

work page 2021
[11]

Fine-grained activities recognition with coarse-grained labeled multi-modal data,

Z. Hu, T. Yu, Y . Zhang, and S. Pan, “Fine-grained activities recognition with coarse-grained labeled multi-modal data,” inAdjunct Proceedings of the 2020 ACM International Joint Conference on pervasive and ubiquitous computing and Proceedings of the 2020 ACM International Symposium on Wearable Computers, 2020, pp. 644–649

work page 2020
[12]

millieye: A lightweight mmwave radar and camera fusion system for robust object detection,

X. Shuai, Y . Shen, Y . Tang, S. Shi, L. Ji, and G. Xing, “millieye: A lightweight mmwave radar and camera fusion system for robust object detection,” inProceedings of the International Conference on Internet-of-Things Design and Implementation, 2021, pp. 145–157

work page 2021
[13]

Clusterfl: a similarity-aware federated learning system for human activity recog- nition,

X. Ouyang, Z. Xie, J. Zhou, J. Huang, and G. Xing, “Clusterfl: a similarity-aware federated learning system for human activity recog- nition,” inProceedings of the 19th annual international conference on mobile systems, applications, and services, 2021, pp. 54–66

work page 2021
[14]

Rfid and camera fusion for recognition of human-object interactions,

X. Liu, D. Liu, J. Zhang, T. Gu, and K. Li, “Rfid and camera fusion for recognition of human-object interactions,” inProceedings of the 27th Annual International Conference on Mobile Computing and Networking, 2021, pp. 296–308

work page 2021
[15]

Momentum contrast for unsupervised visual representation learning,

K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738

work page 2020
[16]

Headfi: bringing intelligence to all headphones,

X. Fan, L. Shangguan, S. Rupavatharam, Y . Zhang, J. Xiong, Y . Ma, and R. Howard, “Headfi: bringing intelligence to all headphones,” in Proceedings of the 27th Annual International Conference on Mobile Computing and Networking, 2021, pp. 147–159

work page 2021
[17]

isleep: unobtrusive sleep quality monitoring using smartphones,

T. Hao, G. Xing, and G. Zhou, “isleep: unobtrusive sleep quality monitoring using smartphones,” inProceedings of the 11th ACM Conference on Embedded Networked Sensor Systems, 2013, pp. 1– 14

work page 2013
[18]

Cnn-based sensor fusion techniques for multimodal human activity recognition,

S. M ¨unzner, P. Schmidt, A. Reiss, M. Hanselmann, R. Stiefelhagen, and R. D ¨urichen, “Cnn-based sensor fusion techniques for multimodal human activity recognition,” inProceedings of the 2017 ACM inter- national symposium on wearable computers, 2017, pp. 158–165

work page 2017
[19]

Sensegan: Enabling deep learning for internet of things with a semi-supervised framework,

S. Yao, Y . Zhao, H. Shao, C. Zhang, A. Zhang, S. Hu, D. Liu, S. Liu, L. Su, and T. Abdelzaher, “Sensegan: Enabling deep learning for internet of things with a semi-supervised framework,”Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, vol. 2, no. 3, pp. 1–21, 2018

work page 2018
[20]

Hmgan: A hierarchical multi-modal generative adversarial network model for wearable human activity recognition,

L. Chen, R. Hu, M. Wu, and X. Zhou, “Hmgan: A hierarchical multi-modal generative adversarial network model for wearable human activity recognition,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 7, no. 3, pp. 1–27, 2023

work page 2023
[21]

A survey on generative adversar- ial networks: Variants, applications, and training,

A. Jabbar, X. Li, and B. Omar, “A survey on generative adversar- ial networks: Variants, applications, and training,”ACM Computing Surveys (CSUR), vol. 54, no. 8, pp. 1–49, 2021

work page 2021
[22]

A survey on unsupervised learning for wearable sensor-based activity recognition,

A. O. Ige and M. H. M. Noor, “A survey on unsupervised learning for wearable sensor-based activity recognition,”Applied Soft Computing, vol. 127, p. 109363, 2022

work page 2022
[23]

Cocoa: Cross modality contrastive learning for sensor data,

S. Deldari, H. Xue, A. Saeed, D. V . Smith, and F. D. Salim, “Cocoa: Cross modality contrastive learning for sensor data,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technolo- gies, vol. 6, no. 3, pp. 1–28, 2022

work page 2022
[24]

Cosmo: contrastive fusion learning with small data for multimodal human activity recognition,

X. Ouyang, X. Shuai, J. Zhou, I. W. Shi, Z. Xie, G. Xing, and J. Huang, “Cosmo: contrastive fusion learning with small data for multimodal human activity recognition,” inProceedings of the 28th Annual In- ternational Conference on Mobile Computing And Networking, 2022, pp. 324–337

work page 2022
[25]

Mesen: Exploit multimodal data to design unimodal human activity recognition with few labels,

L. Xu, C. Gu, R. Tan, S. He, and J. Chen, “Mesen: Exploit multimodal data to design unimodal human activity recognition with few labels,” inProceedings of the 21st ACM Conference on Embedded Networked Sensor Systems, 2023, pp. 1–14

work page 2023
[26]

Babel: A scalable pre-trained model for multi-modal sensing via expandable modality alignment,

S. Dai, S. Jiang, Y . Yang, T. Cao, M. Li, S. Banerjee, and L. Qiu, “Babel: A scalable pre-trained model for multi-modal sensing via expandable modality alignment,” inProceedings of the 23rd ACM Conference on Embedded Networked Sensor Systems, 2025, pp. 240– 253

work page 2025
[27]

Differential transformer, 2024

T. Ye, L. Dong, Y . Xia, Y . Sun, Y . Zhu, G. Huang, and F. Wei, “Differential transformer,”arXiv preprint arXiv:2410.05258, 2024

work page arXiv 2024
[28]

Contrastive multiview coding,

Y . Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” inEuropean conference on computer vision. Springer, 2020, pp. 776–794

work page 2020
[29]

Crossl: Cross-modal self-supervised learning for time- series through latent masking,

S. Deldari, D. Spathis, M. Malekzadeh, F. Kawsar, F. D. Salim, and A. Mathur, “Crossl: Cross-modal self-supervised learning for time- series through latent masking,” inProceedings of the 17th ACM International Conference on Web Search and Data Mining, 2024, pp. 152–160

work page 2024
[30]

Spatial-temporal masked autoencoder for multi-device wearable human activity recognition,

S. Miao, L. Chen, and R. Hu, “Spatial-temporal masked autoencoder for multi-device wearable human activity recognition,”Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., vol. 7, no. 4, Jan. 2024. [Online]. Available: https://doi.org/10.1145/3631415

work page doi:10.1145/3631415 2024
[31]

Master: A multi-modal foundation model for human activity recog- nition,

G. Zhu, D. Zhao, C. Li, M. Zhao, Z. Zhang, H. Quan, and H. Ma, “Master: A multi-modal foundation model for human activity recog- nition,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 9, no. 3, pp. 1–26, 2025

work page 2025
[32]

Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor,

C. Chen, R. Jafari, and N. Kehtarnavaz, “Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor,” in2015 IEEE International conference on image processing (ICIP). IEEE, 2015, pp. 168–172

work page 2015
[33]

Introducing a new benchmarked dataset for activity monitoring,

A. Reiss and D. Stricker, “Introducing a new benchmarked dataset for activity monitoring,” in2012 16th international symposium on wearable computers. IEEE, 2012, pp. 108–109

work page 2012
[34]

Complex human activity recognition using smartphone and wrist- worn motion sensors,

M. Shoaib, S. Bosch, O. D. Incel, H. Scholten, and P. J. Havinga, “Complex human activity recognition using smartphone and wrist- worn motion sensors,”Sensors, vol. 16, no. 4, p. 426, 2016

work page 2016

[1] [1]

Activity recognition in outdoor sports environments: smart data for end-users involving mobile pervasive augmented reality systems,

R. M. Pascoal, A. de Almeida, and R. C. Sofia, “Activity recognition in outdoor sports environments: smart data for end-users involving mobile pervasive augmented reality systems,” inAdjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM Interna- tional Symposium on Wearable...

work page 2019

[2] [2]

When virtual reality meets internet of things in the gym: Enabling immersive interactive machine exercises,

F. Rabbi, T. Park, B. Fang, M. Zhang, and Y . Lee, “When virtual reality meets internet of things in the gym: Enabling immersive interactive machine exercises,”Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, vol. 2, no. 2, pp. 1–21, 2018

work page 2018

[3] [3]

Ok google, what am i doing? acoustic activity recognition bounded by conversational assistant in- teractions,

R. Adaimi, H. Yong, and E. Thomaz, “Ok google, what am i doing? acoustic activity recognition bounded by conversational assistant in- teractions,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 5, no. 1, pp. 1–24, 2021

work page 2021

[4] [4]

Familylog: A mobile system for monitoring family mealtime activities,

C. Bi, G. Xing, T. Hao, J. Huh, W. Peng, and M. Ma, “Familylog: A mobile system for monitoring family mealtime activities,” in2017 ieee international conference on pervasive computing and communications (percom). IEEE, 2017, pp. 21–30

work page 2017

[5] [5]

Ex- ploring the feasibility of remote cardiac auscultation using earphones,

T. Chen, Y . Yang, X. Fan, X. Guo, J. Xiong, and L. Shangguan, “Ex- ploring the feasibility of remote cardiac auscultation using earphones,” inProceedings of the 30th Annual International Conference on Mobile Computing and Networking, 2024, pp. 357–372

work page 2024

[6] [6]

Exploring biomagnetism for inclusive vital sign monitoring: Modeling and implementation,

X. Guo, L. Tan, T. Chen, C. Gu, Y . Shu, S. He, Y . He, J. Chen, and L. Shangguan, “Exploring biomagnetism for inclusive vital sign monitoring: Modeling and implementation,” inProceedings of the 30th Annual International Conference on Mobile Computing and Networking, 2024, pp. 93–107

work page 2024

[7] [7]

Ad- vancing multi-modal sensing through expandable modality alignment,

S. Dai, S. Jiang, Y . Yang, T. Cao, M. Li, S. Banerjee, and L. Qiu, “Ad- vancing multi-modal sensing through expandable modality alignment,” arXiv preprint arXiv:2407.17777, 2024

work page arXiv 2024

[8] [8]

Deepfusion: A deep learning framework for the fusion of heterogeneous sensory data,

H. Xue, W. Jiang, C. Miao, Y . Yuan, F. Ma, X. Ma, Y . Wang, S. Yao, W. Xu, A. Zhanget al., “Deepfusion: A deep learning framework for the fusion of heterogeneous sensory data,” inProceedings of the Twentieth ACM International Symposium on Mobile Ad Hoc Networking and Computing, 2019, pp. 151–160

work page 2019

[9] [9]

Dig- ital biomarkers for alzheimer’s disease: the mobile/wearable devices opportunity,

L. C. Kourtis, O. B. Regele, J. M. Wright, and G. B. Jones, “Dig- ital biomarkers for alzheimer’s disease: the mobile/wearable devices opportunity,”NPJ digital medicine, vol. 2, no. 1, p. 9, 2019

work page 2019

[10] [10]

Contrastive predictive coding for human activity recognition,

H. Haresamudram, I. Essa, and T. Pl ¨otz, “Contrastive predictive coding for human activity recognition,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 5, no. 2, pp. 1–26, 2021

work page 2021

[11] [11]

Fine-grained activities recognition with coarse-grained labeled multi-modal data,

Z. Hu, T. Yu, Y . Zhang, and S. Pan, “Fine-grained activities recognition with coarse-grained labeled multi-modal data,” inAdjunct Proceedings of the 2020 ACM International Joint Conference on pervasive and ubiquitous computing and Proceedings of the 2020 ACM International Symposium on Wearable Computers, 2020, pp. 644–649

work page 2020

[12] [12]

millieye: A lightweight mmwave radar and camera fusion system for robust object detection,

X. Shuai, Y . Shen, Y . Tang, S. Shi, L. Ji, and G. Xing, “millieye: A lightweight mmwave radar and camera fusion system for robust object detection,” inProceedings of the International Conference on Internet-of-Things Design and Implementation, 2021, pp. 145–157

work page 2021

[13] [13]

Clusterfl: a similarity-aware federated learning system for human activity recog- nition,

X. Ouyang, Z. Xie, J. Zhou, J. Huang, and G. Xing, “Clusterfl: a similarity-aware federated learning system for human activity recog- nition,” inProceedings of the 19th annual international conference on mobile systems, applications, and services, 2021, pp. 54–66

work page 2021

[14] [14]

Rfid and camera fusion for recognition of human-object interactions,

X. Liu, D. Liu, J. Zhang, T. Gu, and K. Li, “Rfid and camera fusion for recognition of human-object interactions,” inProceedings of the 27th Annual International Conference on Mobile Computing and Networking, 2021, pp. 296–308

work page 2021

[15] [15]

Momentum contrast for unsupervised visual representation learning,

K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738

work page 2020

[16] [16]

Headfi: bringing intelligence to all headphones,

X. Fan, L. Shangguan, S. Rupavatharam, Y . Zhang, J. Xiong, Y . Ma, and R. Howard, “Headfi: bringing intelligence to all headphones,” in Proceedings of the 27th Annual International Conference on Mobile Computing and Networking, 2021, pp. 147–159

work page 2021

[17] [17]

isleep: unobtrusive sleep quality monitoring using smartphones,

T. Hao, G. Xing, and G. Zhou, “isleep: unobtrusive sleep quality monitoring using smartphones,” inProceedings of the 11th ACM Conference on Embedded Networked Sensor Systems, 2013, pp. 1– 14

work page 2013

[18] [18]

Cnn-based sensor fusion techniques for multimodal human activity recognition,

S. M ¨unzner, P. Schmidt, A. Reiss, M. Hanselmann, R. Stiefelhagen, and R. D ¨urichen, “Cnn-based sensor fusion techniques for multimodal human activity recognition,” inProceedings of the 2017 ACM inter- national symposium on wearable computers, 2017, pp. 158–165

work page 2017

[19] [19]

Sensegan: Enabling deep learning for internet of things with a semi-supervised framework,

S. Yao, Y . Zhao, H. Shao, C. Zhang, A. Zhang, S. Hu, D. Liu, S. Liu, L. Su, and T. Abdelzaher, “Sensegan: Enabling deep learning for internet of things with a semi-supervised framework,”Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, vol. 2, no. 3, pp. 1–21, 2018

work page 2018

[20] [20]

Hmgan: A hierarchical multi-modal generative adversarial network model for wearable human activity recognition,

L. Chen, R. Hu, M. Wu, and X. Zhou, “Hmgan: A hierarchical multi-modal generative adversarial network model for wearable human activity recognition,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 7, no. 3, pp. 1–27, 2023

work page 2023

[21] [21]

A survey on generative adversar- ial networks: Variants, applications, and training,

A. Jabbar, X. Li, and B. Omar, “A survey on generative adversar- ial networks: Variants, applications, and training,”ACM Computing Surveys (CSUR), vol. 54, no. 8, pp. 1–49, 2021

work page 2021

[22] [22]

A survey on unsupervised learning for wearable sensor-based activity recognition,

A. O. Ige and M. H. M. Noor, “A survey on unsupervised learning for wearable sensor-based activity recognition,”Applied Soft Computing, vol. 127, p. 109363, 2022

work page 2022

[23] [23]

Cocoa: Cross modality contrastive learning for sensor data,

S. Deldari, H. Xue, A. Saeed, D. V . Smith, and F. D. Salim, “Cocoa: Cross modality contrastive learning for sensor data,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technolo- gies, vol. 6, no. 3, pp. 1–28, 2022

work page 2022

[24] [24]

Cosmo: contrastive fusion learning with small data for multimodal human activity recognition,

X. Ouyang, X. Shuai, J. Zhou, I. W. Shi, Z. Xie, G. Xing, and J. Huang, “Cosmo: contrastive fusion learning with small data for multimodal human activity recognition,” inProceedings of the 28th Annual In- ternational Conference on Mobile Computing And Networking, 2022, pp. 324–337

work page 2022

[25] [25]

Mesen: Exploit multimodal data to design unimodal human activity recognition with few labels,

L. Xu, C. Gu, R. Tan, S. He, and J. Chen, “Mesen: Exploit multimodal data to design unimodal human activity recognition with few labels,” inProceedings of the 21st ACM Conference on Embedded Networked Sensor Systems, 2023, pp. 1–14

work page 2023

[26] [26]

Babel: A scalable pre-trained model for multi-modal sensing via expandable modality alignment,

S. Dai, S. Jiang, Y . Yang, T. Cao, M. Li, S. Banerjee, and L. Qiu, “Babel: A scalable pre-trained model for multi-modal sensing via expandable modality alignment,” inProceedings of the 23rd ACM Conference on Embedded Networked Sensor Systems, 2025, pp. 240– 253

work page 2025

[27] [27]

Differential transformer, 2024

T. Ye, L. Dong, Y . Xia, Y . Sun, Y . Zhu, G. Huang, and F. Wei, “Differential transformer,”arXiv preprint arXiv:2410.05258, 2024

work page arXiv 2024

[28] [28]

Contrastive multiview coding,

Y . Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” inEuropean conference on computer vision. Springer, 2020, pp. 776–794

work page 2020

[29] [29]

Crossl: Cross-modal self-supervised learning for time- series through latent masking,

S. Deldari, D. Spathis, M. Malekzadeh, F. Kawsar, F. D. Salim, and A. Mathur, “Crossl: Cross-modal self-supervised learning for time- series through latent masking,” inProceedings of the 17th ACM International Conference on Web Search and Data Mining, 2024, pp. 152–160

work page 2024

[30] [30]

Spatial-temporal masked autoencoder for multi-device wearable human activity recognition,

S. Miao, L. Chen, and R. Hu, “Spatial-temporal masked autoencoder for multi-device wearable human activity recognition,”Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., vol. 7, no. 4, Jan. 2024. [Online]. Available: https://doi.org/10.1145/3631415

work page doi:10.1145/3631415 2024

[31] [31]

Master: A multi-modal foundation model for human activity recog- nition,

G. Zhu, D. Zhao, C. Li, M. Zhao, Z. Zhang, H. Quan, and H. Ma, “Master: A multi-modal foundation model for human activity recog- nition,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 9, no. 3, pp. 1–26, 2025

work page 2025

[32] [32]

Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor,

C. Chen, R. Jafari, and N. Kehtarnavaz, “Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor,” in2015 IEEE International conference on image processing (ICIP). IEEE, 2015, pp. 168–172

work page 2015

[33] [33]

Introducing a new benchmarked dataset for activity monitoring,

A. Reiss and D. Stricker, “Introducing a new benchmarked dataset for activity monitoring,” in2012 16th international symposium on wearable computers. IEEE, 2012, pp. 108–109

work page 2012

[34] [34]

Complex human activity recognition using smartphone and wrist- worn motion sensors,

M. Shoaib, S. Bosch, O. D. Incel, H. Scholten, and P. J. Havinga, “Complex human activity recognition using smartphone and wrist- worn motion sensors,”Sensors, vol. 16, no. 4, p. 426, 2016

work page 2016