Contrastive Learning for Multimodal Human Activity Recognition with Limited Labeled Data
Pith reviewed 2026-05-08 08:25 UTC · model grok-4.3
The pith
CLMM uses two-stage contrastive learning to raise multimodal human activity recognition accuracy when labels are scarce.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CLMM is a general contrastive learning framework that achieves effective multimodal recognition with limited labeled data. It employs a novel two-stage training strategy. In the first stage, a CNN-DiffTransformer encoder captures cross-modal shared information by extracting local and global features, while a hard-positive samples weighting algorithm enhances gradient propagation. In the second stage, a dual-branch architecture combining quality-guided attention and bidirectional gated units captures modality-specific information, and a primary-auxiliary collaborative training strategy fuses shared and specific information.
What carries the argument
The CLMM two-stage contrastive framework, where the first stage uses a CNN-DiffTransformer encoder with hard-positive weighting to learn shared features and the second stage uses dual-branch quality-guided attention with primary-auxiliary training to integrate modality-specific features.
If this is right
- CLMM raises recognition accuracy above current state-of-the-art baselines across three public multimodal datasets.
- CLMM reaches higher accuracy faster during training than prior methods.
- CLMM handles heterogeneous multi-sensor data effectively even when only limited labels are available.
Where Pith is reading between the lines
- The same two-stage contrastive pattern could be tested on other sensor-fusion tasks such as gesture recognition or environmental monitoring.
- Applying CLMM to streaming real-time data would test whether the collaborative training supports low-latency inference.
- Adding mechanisms for automatic modality selection might extend the approach to cases where some sensors are intermittently unavailable.
Load-bearing premise
The CNN-DiffTransformer encoder, hard-positive weighting, quality-guided attention, and primary-auxiliary collaborative training will generalize to new multimodal datasets without overfitting to the limited labels in the three tested cases.
What would settle it
Evaluating CLMM on a fourth multimodal human activity dataset with scarce labels and finding that its accuracy or convergence speed does not exceed existing state-of-the-art baselines would disprove the central claim.
Figures
read the original abstract
Human activity recognition serves as the foundation for various emerging applications. In recent years, researchers have used collaborative sensing of multi-source sensors to capture complex and dynamic human activities. However, multimodal human activity sensing typically encounters highly heterogeneous data across modalities and label scarcity, resulting in an application gap between existing solutions and real-world needs. In this paper, we propose CLMM, a general contrastive learning framework for human activity recognition that achieves effective multimodal recognition with limited labeled data. CLMM employs a novel two-stage training strategy. In the first stage, CLMM employs a CNN-DiffTransformer encoder to capture cross-modal shared information by extracting local and global features. Meanwhile, a hard-positive samples weighting algorithm enhances gradient propagation to reinforce shared learning. In the second stage, a dual-branch architecture combining quality-guided attention and bidirectional gated units captures modality-specific information, while a primary-auxiliary collaborative training strategy fuses both shared and modality-specific information. Experimental results on three public datasets demonstrate that CLMM significantly improves state-of-the-art baselines in both recognition accuracy and convergence performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes CLMM, a contrastive learning framework for multimodal human activity recognition under limited labeled data. It introduces a two-stage training strategy: stage one uses a CNN-DiffTransformer encoder to extract local and global cross-modal shared features, augmented by a hard-positive samples weighting algorithm; stage two employs a dual-branch architecture with quality-guided attention and bidirectional gated units, combined via primary-auxiliary collaborative training to fuse shared and modality-specific information. The central claim is that CLMM significantly improves recognition accuracy and convergence performance over state-of-the-art baselines on three public datasets.
Significance. If the reported gains are supported by fair, controlled experiments under identical limited-label regimes, the work could meaningfully advance label-efficient multimodal HAR by demonstrating how contrastive pre-training and staged fusion can mitigate data heterogeneity and scarcity. The specific architectural choices (CNN-DiffTransformer, hard-positive weighting, quality-guided attention) represent targeted innovations that, if validated, would be of interest to the sensing and activity recognition communities.
major comments (2)
- [Abstract] Abstract: the assertion that CLMM 'significantly improves state-of-the-art baselines in both recognition accuracy and convergence performance' supplies no quantitative numbers, baseline re-implementation details, ablation studies, or statistical tests, rendering the central empirical claim unverifiable from the provided text.
- [Experimental protocol] Experimental protocol (presumably §4 or §5): the manuscript does not explicitly confirm that all cited SOTA baselines were re-trained or re-evaluated using the exact same label scarcity fractions, data splits, and limited-label conditions applied to CLMM. If original full-supervision numbers are instead referenced, the accuracy and convergence deltas cannot be attributed to the proposed contrastive components or two-stage strategy.
minor comments (1)
- [Method description] The hard-positive weighting algorithm and quality-guided attention mechanism are described procedurally but lack accompanying equations, loss formulations, or pseudocode, which would aid reproducibility and allow readers to assess their precise contribution to gradient flow and modality fusion.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we have revised the manuscript to enhance clarity and verifiability of our claims and experimental details.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that CLMM 'significantly improves state-of-the-art baselines in both recognition accuracy and convergence performance' supplies no quantitative numbers, baseline re-implementation details, ablation studies, or statistical tests, rendering the central empirical claim unverifiable from the provided text.
Authors: We agree that the abstract would benefit from more specific support for the central claim to allow immediate verification. The detailed quantitative results, including accuracy improvements, convergence performance metrics, ablation studies on components such as the CNN-DiffTransformer and hard-positive weighting, and statistical tests are presented in the experimental section of the manuscript. In the revised manuscript, we have updated the abstract to include key quantitative highlights from our experiments while preserving its brevity. This addresses the verifiability concern directly. revision: yes
-
Referee: [Experimental protocol] Experimental protocol (presumably §4 or §5): the manuscript does not explicitly confirm that all cited SOTA baselines were re-trained or re-evaluated using the exact same label scarcity fractions, data splits, and limited-label conditions applied to CLMM. If original full-supervision numbers are instead referenced, the accuracy and convergence deltas cannot be attributed to the proposed contrastive components or two-stage strategy.
Authors: We confirm that all state-of-the-art baselines were re-implemented and re-evaluated under the exact same limited-label conditions, including identical label scarcity fractions, data splits, and evaluation protocols as those used for CLMM. This ensures fair comparison and that the observed improvements can be attributed to our proposed two-stage contrastive framework. To make this explicit and eliminate any ambiguity, we have added a clear statement and a dedicated paragraph in the revised experimental protocol section detailing the re-training procedure and confirming the identical settings for all methods. revision: yes
Circularity Check
No circularity: purely empirical procedural framework with no derivations or self-referential predictions
full rationale
The paper describes a two-stage contrastive learning architecture (CNN-DiffTransformer encoder, hard-positive weighting, quality-guided attention, primary-auxiliary training) and reports empirical accuracy gains on three public datasets. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or method description. All claims rest on experimental comparisons rather than any derivation chain that reduces to its own inputs by construction. This is the expected non-finding for an applied ML framework paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
R. M. Pascoal, A. de Almeida, and R. C. Sofia, “Activity recognition in outdoor sports environments: smart data for end-users involving mobile pervasive augmented reality systems,” inAdjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM Interna- tional Symposium on Wearable...
work page 2019
-
[2]
F. Rabbi, T. Park, B. Fang, M. Zhang, and Y . Lee, “When virtual reality meets internet of things in the gym: Enabling immersive interactive machine exercises,”Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, vol. 2, no. 2, pp. 1–21, 2018
work page 2018
-
[3]
R. Adaimi, H. Yong, and E. Thomaz, “Ok google, what am i doing? acoustic activity recognition bounded by conversational assistant in- teractions,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 5, no. 1, pp. 1–24, 2021
work page 2021
-
[4]
Familylog: A mobile system for monitoring family mealtime activities,
C. Bi, G. Xing, T. Hao, J. Huh, W. Peng, and M. Ma, “Familylog: A mobile system for monitoring family mealtime activities,” in2017 ieee international conference on pervasive computing and communications (percom). IEEE, 2017, pp. 21–30
work page 2017
-
[5]
Ex- ploring the feasibility of remote cardiac auscultation using earphones,
T. Chen, Y . Yang, X. Fan, X. Guo, J. Xiong, and L. Shangguan, “Ex- ploring the feasibility of remote cardiac auscultation using earphones,” inProceedings of the 30th Annual International Conference on Mobile Computing and Networking, 2024, pp. 357–372
work page 2024
-
[6]
Exploring biomagnetism for inclusive vital sign monitoring: Modeling and implementation,
X. Guo, L. Tan, T. Chen, C. Gu, Y . Shu, S. He, Y . He, J. Chen, and L. Shangguan, “Exploring biomagnetism for inclusive vital sign monitoring: Modeling and implementation,” inProceedings of the 30th Annual International Conference on Mobile Computing and Networking, 2024, pp. 93–107
work page 2024
-
[7]
Ad- vancing multi-modal sensing through expandable modality alignment,
S. Dai, S. Jiang, Y . Yang, T. Cao, M. Li, S. Banerjee, and L. Qiu, “Ad- vancing multi-modal sensing through expandable modality alignment,” arXiv preprint arXiv:2407.17777, 2024
-
[8]
Deepfusion: A deep learning framework for the fusion of heterogeneous sensory data,
H. Xue, W. Jiang, C. Miao, Y . Yuan, F. Ma, X. Ma, Y . Wang, S. Yao, W. Xu, A. Zhanget al., “Deepfusion: A deep learning framework for the fusion of heterogeneous sensory data,” inProceedings of the Twentieth ACM International Symposium on Mobile Ad Hoc Networking and Computing, 2019, pp. 151–160
work page 2019
-
[9]
Dig- ital biomarkers for alzheimer’s disease: the mobile/wearable devices opportunity,
L. C. Kourtis, O. B. Regele, J. M. Wright, and G. B. Jones, “Dig- ital biomarkers for alzheimer’s disease: the mobile/wearable devices opportunity,”NPJ digital medicine, vol. 2, no. 1, p. 9, 2019
work page 2019
-
[10]
Contrastive predictive coding for human activity recognition,
H. Haresamudram, I. Essa, and T. Pl ¨otz, “Contrastive predictive coding for human activity recognition,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 5, no. 2, pp. 1–26, 2021
work page 2021
-
[11]
Fine-grained activities recognition with coarse-grained labeled multi-modal data,
Z. Hu, T. Yu, Y . Zhang, and S. Pan, “Fine-grained activities recognition with coarse-grained labeled multi-modal data,” inAdjunct Proceedings of the 2020 ACM International Joint Conference on pervasive and ubiquitous computing and Proceedings of the 2020 ACM International Symposium on Wearable Computers, 2020, pp. 644–649
work page 2020
-
[12]
millieye: A lightweight mmwave radar and camera fusion system for robust object detection,
X. Shuai, Y . Shen, Y . Tang, S. Shi, L. Ji, and G. Xing, “millieye: A lightweight mmwave radar and camera fusion system for robust object detection,” inProceedings of the International Conference on Internet-of-Things Design and Implementation, 2021, pp. 145–157
work page 2021
-
[13]
Clusterfl: a similarity-aware federated learning system for human activity recog- nition,
X. Ouyang, Z. Xie, J. Zhou, J. Huang, and G. Xing, “Clusterfl: a similarity-aware federated learning system for human activity recog- nition,” inProceedings of the 19th annual international conference on mobile systems, applications, and services, 2021, pp. 54–66
work page 2021
-
[14]
Rfid and camera fusion for recognition of human-object interactions,
X. Liu, D. Liu, J. Zhang, T. Gu, and K. Li, “Rfid and camera fusion for recognition of human-object interactions,” inProceedings of the 27th Annual International Conference on Mobile Computing and Networking, 2021, pp. 296–308
work page 2021
-
[15]
Momentum contrast for unsupervised visual representation learning,
K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738
work page 2020
-
[16]
Headfi: bringing intelligence to all headphones,
X. Fan, L. Shangguan, S. Rupavatharam, Y . Zhang, J. Xiong, Y . Ma, and R. Howard, “Headfi: bringing intelligence to all headphones,” in Proceedings of the 27th Annual International Conference on Mobile Computing and Networking, 2021, pp. 147–159
work page 2021
-
[17]
isleep: unobtrusive sleep quality monitoring using smartphones,
T. Hao, G. Xing, and G. Zhou, “isleep: unobtrusive sleep quality monitoring using smartphones,” inProceedings of the 11th ACM Conference on Embedded Networked Sensor Systems, 2013, pp. 1– 14
work page 2013
-
[18]
Cnn-based sensor fusion techniques for multimodal human activity recognition,
S. M ¨unzner, P. Schmidt, A. Reiss, M. Hanselmann, R. Stiefelhagen, and R. D ¨urichen, “Cnn-based sensor fusion techniques for multimodal human activity recognition,” inProceedings of the 2017 ACM inter- national symposium on wearable computers, 2017, pp. 158–165
work page 2017
-
[19]
Sensegan: Enabling deep learning for internet of things with a semi-supervised framework,
S. Yao, Y . Zhao, H. Shao, C. Zhang, A. Zhang, S. Hu, D. Liu, S. Liu, L. Su, and T. Abdelzaher, “Sensegan: Enabling deep learning for internet of things with a semi-supervised framework,”Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, vol. 2, no. 3, pp. 1–21, 2018
work page 2018
-
[20]
L. Chen, R. Hu, M. Wu, and X. Zhou, “Hmgan: A hierarchical multi-modal generative adversarial network model for wearable human activity recognition,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 7, no. 3, pp. 1–27, 2023
work page 2023
-
[21]
A survey on generative adversar- ial networks: Variants, applications, and training,
A. Jabbar, X. Li, and B. Omar, “A survey on generative adversar- ial networks: Variants, applications, and training,”ACM Computing Surveys (CSUR), vol. 54, no. 8, pp. 1–49, 2021
work page 2021
-
[22]
A survey on unsupervised learning for wearable sensor-based activity recognition,
A. O. Ige and M. H. M. Noor, “A survey on unsupervised learning for wearable sensor-based activity recognition,”Applied Soft Computing, vol. 127, p. 109363, 2022
work page 2022
-
[23]
Cocoa: Cross modality contrastive learning for sensor data,
S. Deldari, H. Xue, A. Saeed, D. V . Smith, and F. D. Salim, “Cocoa: Cross modality contrastive learning for sensor data,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technolo- gies, vol. 6, no. 3, pp. 1–28, 2022
work page 2022
-
[24]
Cosmo: contrastive fusion learning with small data for multimodal human activity recognition,
X. Ouyang, X. Shuai, J. Zhou, I. W. Shi, Z. Xie, G. Xing, and J. Huang, “Cosmo: contrastive fusion learning with small data for multimodal human activity recognition,” inProceedings of the 28th Annual In- ternational Conference on Mobile Computing And Networking, 2022, pp. 324–337
work page 2022
-
[25]
Mesen: Exploit multimodal data to design unimodal human activity recognition with few labels,
L. Xu, C. Gu, R. Tan, S. He, and J. Chen, “Mesen: Exploit multimodal data to design unimodal human activity recognition with few labels,” inProceedings of the 21st ACM Conference on Embedded Networked Sensor Systems, 2023, pp. 1–14
work page 2023
-
[26]
Babel: A scalable pre-trained model for multi-modal sensing via expandable modality alignment,
S. Dai, S. Jiang, Y . Yang, T. Cao, M. Li, S. Banerjee, and L. Qiu, “Babel: A scalable pre-trained model for multi-modal sensing via expandable modality alignment,” inProceedings of the 23rd ACM Conference on Embedded Networked Sensor Systems, 2025, pp. 240– 253
work page 2025
-
[27]
Differential transformer, 2024
T. Ye, L. Dong, Y . Xia, Y . Sun, Y . Zhu, G. Huang, and F. Wei, “Differential transformer,”arXiv preprint arXiv:2410.05258, 2024
-
[28]
Y . Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” inEuropean conference on computer vision. Springer, 2020, pp. 776–794
work page 2020
-
[29]
Crossl: Cross-modal self-supervised learning for time- series through latent masking,
S. Deldari, D. Spathis, M. Malekzadeh, F. Kawsar, F. D. Salim, and A. Mathur, “Crossl: Cross-modal self-supervised learning for time- series through latent masking,” inProceedings of the 17th ACM International Conference on Web Search and Data Mining, 2024, pp. 152–160
work page 2024
-
[30]
Spatial-temporal masked autoencoder for multi-device wearable human activity recognition,
S. Miao, L. Chen, and R. Hu, “Spatial-temporal masked autoencoder for multi-device wearable human activity recognition,”Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., vol. 7, no. 4, Jan. 2024. [Online]. Available: https://doi.org/10.1145/3631415
-
[31]
Master: A multi-modal foundation model for human activity recog- nition,
G. Zhu, D. Zhao, C. Li, M. Zhao, Z. Zhang, H. Quan, and H. Ma, “Master: A multi-modal foundation model for human activity recog- nition,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 9, no. 3, pp. 1–26, 2025
work page 2025
-
[32]
C. Chen, R. Jafari, and N. Kehtarnavaz, “Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor,” in2015 IEEE International conference on image processing (ICIP). IEEE, 2015, pp. 168–172
work page 2015
-
[33]
Introducing a new benchmarked dataset for activity monitoring,
A. Reiss and D. Stricker, “Introducing a new benchmarked dataset for activity monitoring,” in2012 16th international symposium on wearable computers. IEEE, 2012, pp. 108–109
work page 2012
-
[34]
Complex human activity recognition using smartphone and wrist- worn motion sensors,
M. Shoaib, S. Bosch, O. D. Incel, H. Scholten, and P. J. Havinga, “Complex human activity recognition using smartphone and wrist- worn motion sensors,”Sensors, vol. 16, no. 4, p. 426, 2016
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.