pith. sign in

arxiv: 2604.23281 · v1 · submitted 2026-04-25 · 💻 cs.LG · cs.CV

Contrastive Learning for Multimodal Human Activity Recognition with Limited Labeled Data

Pith reviewed 2026-05-08 08:25 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords contrastive learningmultimodal human activity recognitionlimited labeled datatwo-stage trainingcross-modal featuresCNN-DiffTransformer
0
0 comments X

The pith

CLMM uses two-stage contrastive learning to raise multimodal human activity recognition accuracy when labels are scarce.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CLMM as a contrastive learning framework for multimodal human activity recognition that works effectively even when labeled data is limited and sensor inputs are heterogeneous. It proposes a two-stage process that first extracts shared cross-modal features with a CNN-DiffTransformer encoder plus hard-positive weighting, then fuses modality-specific details through a dual-branch setup with quality-guided attention and primary-auxiliary training. A sympathetic reader would care because closing this performance gap could make sensor-based activity systems practical for real-world uses such as health tracking or interactive environments where collecting many labels is costly. Experiments on three public datasets show the approach improves both final accuracy and training convergence over prior baselines.

Core claim

CLMM is a general contrastive learning framework that achieves effective multimodal recognition with limited labeled data. It employs a novel two-stage training strategy. In the first stage, a CNN-DiffTransformer encoder captures cross-modal shared information by extracting local and global features, while a hard-positive samples weighting algorithm enhances gradient propagation. In the second stage, a dual-branch architecture combining quality-guided attention and bidirectional gated units captures modality-specific information, and a primary-auxiliary collaborative training strategy fuses shared and specific information.

What carries the argument

The CLMM two-stage contrastive framework, where the first stage uses a CNN-DiffTransformer encoder with hard-positive weighting to learn shared features and the second stage uses dual-branch quality-guided attention with primary-auxiliary training to integrate modality-specific features.

If this is right

  • CLMM raises recognition accuracy above current state-of-the-art baselines across three public multimodal datasets.
  • CLMM reaches higher accuracy faster during training than prior methods.
  • CLMM handles heterogeneous multi-sensor data effectively even when only limited labels are available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-stage contrastive pattern could be tested on other sensor-fusion tasks such as gesture recognition or environmental monitoring.
  • Applying CLMM to streaming real-time data would test whether the collaborative training supports low-latency inference.
  • Adding mechanisms for automatic modality selection might extend the approach to cases where some sensors are intermittently unavailable.

Load-bearing premise

The CNN-DiffTransformer encoder, hard-positive weighting, quality-guided attention, and primary-auxiliary collaborative training will generalize to new multimodal datasets without overfitting to the limited labels in the three tested cases.

What would settle it

Evaluating CLMM on a fourth multimodal human activity dataset with scarce labels and finding that its accuracy or convergence speed does not exceed existing state-of-the-art baselines would disprove the central claim.

Figures

Figures reproduced from arXiv: 2604.23281 by Long Jing, Xinlong Feng, Yajun Zhang, Zhixiong Yang.

Figure 1
Figure 1. Figure 1: A typical application of CLMM in multimodal human activity view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CLMM. a discriminator distinguishes real from generated samples, ef￾fectively combining limited labels with unlabeled data. HM￾GAN [20] learns both shared and modality-specific features via a generator, and employs a hierarchical discriminator that computes modality-level and global adversarial losses to produce multimodal samples from limited annotations. However, GANs require training both th… view at source ↗
Figure 4
Figure 4. Figure 4: Workflow of the hard positive samples weighting algorithm. Low view at source ↗
Figure 6
Figure 6. Figure 6: Performance comparison between proposed Primary-Auxiliary view at source ↗
Figure 5
Figure 5. Figure 5: Dual-branch architecture extracts modality-specific information view at source ↗
Figure 9
Figure 9. Figure 9: Accuracy wi￾th/without labeled data from target subjects. M1 M2 M3 0. 5 0. 6 0. 7 0. 8 0. 9 A c c u r a c y Modal i ty Combi n ati on s view at source ↗
Figure 12
Figure 12. Figure 12: Convergence Performance of CLMM, CMC and Cosmo on the view at source ↗
Figure 13
Figure 13. Figure 13: Hyperparameter sensitivity analysis of CLMM across three datasets. view at source ↗
read the original abstract

Human activity recognition serves as the foundation for various emerging applications. In recent years, researchers have used collaborative sensing of multi-source sensors to capture complex and dynamic human activities. However, multimodal human activity sensing typically encounters highly heterogeneous data across modalities and label scarcity, resulting in an application gap between existing solutions and real-world needs. In this paper, we propose CLMM, a general contrastive learning framework for human activity recognition that achieves effective multimodal recognition with limited labeled data. CLMM employs a novel two-stage training strategy. In the first stage, CLMM employs a CNN-DiffTransformer encoder to capture cross-modal shared information by extracting local and global features. Meanwhile, a hard-positive samples weighting algorithm enhances gradient propagation to reinforce shared learning. In the second stage, a dual-branch architecture combining quality-guided attention and bidirectional gated units captures modality-specific information, while a primary-auxiliary collaborative training strategy fuses both shared and modality-specific information. Experimental results on three public datasets demonstrate that CLMM significantly improves state-of-the-art baselines in both recognition accuracy and convergence performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes CLMM, a contrastive learning framework for multimodal human activity recognition under limited labeled data. It introduces a two-stage training strategy: stage one uses a CNN-DiffTransformer encoder to extract local and global cross-modal shared features, augmented by a hard-positive samples weighting algorithm; stage two employs a dual-branch architecture with quality-guided attention and bidirectional gated units, combined via primary-auxiliary collaborative training to fuse shared and modality-specific information. The central claim is that CLMM significantly improves recognition accuracy and convergence performance over state-of-the-art baselines on three public datasets.

Significance. If the reported gains are supported by fair, controlled experiments under identical limited-label regimes, the work could meaningfully advance label-efficient multimodal HAR by demonstrating how contrastive pre-training and staged fusion can mitigate data heterogeneity and scarcity. The specific architectural choices (CNN-DiffTransformer, hard-positive weighting, quality-guided attention) represent targeted innovations that, if validated, would be of interest to the sensing and activity recognition communities.

major comments (2)
  1. [Abstract] Abstract: the assertion that CLMM 'significantly improves state-of-the-art baselines in both recognition accuracy and convergence performance' supplies no quantitative numbers, baseline re-implementation details, ablation studies, or statistical tests, rendering the central empirical claim unverifiable from the provided text.
  2. [Experimental protocol] Experimental protocol (presumably §4 or §5): the manuscript does not explicitly confirm that all cited SOTA baselines were re-trained or re-evaluated using the exact same label scarcity fractions, data splits, and limited-label conditions applied to CLMM. If original full-supervision numbers are instead referenced, the accuracy and convergence deltas cannot be attributed to the proposed contrastive components or two-stage strategy.
minor comments (1)
  1. [Method description] The hard-positive weighting algorithm and quality-guided attention mechanism are described procedurally but lack accompanying equations, loss formulations, or pseudocode, which would aid reproducibility and allow readers to assess their precise contribution to gradient flow and modality fusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we have revised the manuscript to enhance clarity and verifiability of our claims and experimental details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that CLMM 'significantly improves state-of-the-art baselines in both recognition accuracy and convergence performance' supplies no quantitative numbers, baseline re-implementation details, ablation studies, or statistical tests, rendering the central empirical claim unverifiable from the provided text.

    Authors: We agree that the abstract would benefit from more specific support for the central claim to allow immediate verification. The detailed quantitative results, including accuracy improvements, convergence performance metrics, ablation studies on components such as the CNN-DiffTransformer and hard-positive weighting, and statistical tests are presented in the experimental section of the manuscript. In the revised manuscript, we have updated the abstract to include key quantitative highlights from our experiments while preserving its brevity. This addresses the verifiability concern directly. revision: yes

  2. Referee: [Experimental protocol] Experimental protocol (presumably §4 or §5): the manuscript does not explicitly confirm that all cited SOTA baselines were re-trained or re-evaluated using the exact same label scarcity fractions, data splits, and limited-label conditions applied to CLMM. If original full-supervision numbers are instead referenced, the accuracy and convergence deltas cannot be attributed to the proposed contrastive components or two-stage strategy.

    Authors: We confirm that all state-of-the-art baselines were re-implemented and re-evaluated under the exact same limited-label conditions, including identical label scarcity fractions, data splits, and evaluation protocols as those used for CLMM. This ensures fair comparison and that the observed improvements can be attributed to our proposed two-stage contrastive framework. To make this explicit and eliminate any ambiguity, we have added a clear statement and a dedicated paragraph in the revised experimental protocol section detailing the re-training procedure and confirming the identical settings for all methods. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical procedural framework with no derivations or self-referential predictions

full rationale

The paper describes a two-stage contrastive learning architecture (CNN-DiffTransformer encoder, hard-positive weighting, quality-guided attention, primary-auxiliary training) and reports empirical accuracy gains on three public datasets. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or method description. All claims rest on experimental comparisons rather than any derivation chain that reduces to its own inputs by construction. This is the expected non-finding for an applied ML framework paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the framework is described at a high level without mathematical details.

pith-pipeline@v0.9.0 · 5490 in / 1038 out tokens · 46894 ms · 2026-05-08T08:25:10.364373+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

  1. [1]

    Activity recognition in outdoor sports environments: smart data for end-users involving mobile pervasive augmented reality systems,

    R. M. Pascoal, A. de Almeida, and R. C. Sofia, “Activity recognition in outdoor sports environments: smart data for end-users involving mobile pervasive augmented reality systems,” inAdjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM Interna- tional Symposium on Wearable...

  2. [2]

    When virtual reality meets internet of things in the gym: Enabling immersive interactive machine exercises,

    F. Rabbi, T. Park, B. Fang, M. Zhang, and Y . Lee, “When virtual reality meets internet of things in the gym: Enabling immersive interactive machine exercises,”Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, vol. 2, no. 2, pp. 1–21, 2018

  3. [3]

    Ok google, what am i doing? acoustic activity recognition bounded by conversational assistant in- teractions,

    R. Adaimi, H. Yong, and E. Thomaz, “Ok google, what am i doing? acoustic activity recognition bounded by conversational assistant in- teractions,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 5, no. 1, pp. 1–24, 2021

  4. [4]

    Familylog: A mobile system for monitoring family mealtime activities,

    C. Bi, G. Xing, T. Hao, J. Huh, W. Peng, and M. Ma, “Familylog: A mobile system for monitoring family mealtime activities,” in2017 ieee international conference on pervasive computing and communications (percom). IEEE, 2017, pp. 21–30

  5. [5]

    Ex- ploring the feasibility of remote cardiac auscultation using earphones,

    T. Chen, Y . Yang, X. Fan, X. Guo, J. Xiong, and L. Shangguan, “Ex- ploring the feasibility of remote cardiac auscultation using earphones,” inProceedings of the 30th Annual International Conference on Mobile Computing and Networking, 2024, pp. 357–372

  6. [6]

    Exploring biomagnetism for inclusive vital sign monitoring: Modeling and implementation,

    X. Guo, L. Tan, T. Chen, C. Gu, Y . Shu, S. He, Y . He, J. Chen, and L. Shangguan, “Exploring biomagnetism for inclusive vital sign monitoring: Modeling and implementation,” inProceedings of the 30th Annual International Conference on Mobile Computing and Networking, 2024, pp. 93–107

  7. [7]

    Ad- vancing multi-modal sensing through expandable modality alignment,

    S. Dai, S. Jiang, Y . Yang, T. Cao, M. Li, S. Banerjee, and L. Qiu, “Ad- vancing multi-modal sensing through expandable modality alignment,” arXiv preprint arXiv:2407.17777, 2024

  8. [8]

    Deepfusion: A deep learning framework for the fusion of heterogeneous sensory data,

    H. Xue, W. Jiang, C. Miao, Y . Yuan, F. Ma, X. Ma, Y . Wang, S. Yao, W. Xu, A. Zhanget al., “Deepfusion: A deep learning framework for the fusion of heterogeneous sensory data,” inProceedings of the Twentieth ACM International Symposium on Mobile Ad Hoc Networking and Computing, 2019, pp. 151–160

  9. [9]

    Dig- ital biomarkers for alzheimer’s disease: the mobile/wearable devices opportunity,

    L. C. Kourtis, O. B. Regele, J. M. Wright, and G. B. Jones, “Dig- ital biomarkers for alzheimer’s disease: the mobile/wearable devices opportunity,”NPJ digital medicine, vol. 2, no. 1, p. 9, 2019

  10. [10]

    Contrastive predictive coding for human activity recognition,

    H. Haresamudram, I. Essa, and T. Pl ¨otz, “Contrastive predictive coding for human activity recognition,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 5, no. 2, pp. 1–26, 2021

  11. [11]

    Fine-grained activities recognition with coarse-grained labeled multi-modal data,

    Z. Hu, T. Yu, Y . Zhang, and S. Pan, “Fine-grained activities recognition with coarse-grained labeled multi-modal data,” inAdjunct Proceedings of the 2020 ACM International Joint Conference on pervasive and ubiquitous computing and Proceedings of the 2020 ACM International Symposium on Wearable Computers, 2020, pp. 644–649

  12. [12]

    millieye: A lightweight mmwave radar and camera fusion system for robust object detection,

    X. Shuai, Y . Shen, Y . Tang, S. Shi, L. Ji, and G. Xing, “millieye: A lightweight mmwave radar and camera fusion system for robust object detection,” inProceedings of the International Conference on Internet-of-Things Design and Implementation, 2021, pp. 145–157

  13. [13]

    Clusterfl: a similarity-aware federated learning system for human activity recog- nition,

    X. Ouyang, Z. Xie, J. Zhou, J. Huang, and G. Xing, “Clusterfl: a similarity-aware federated learning system for human activity recog- nition,” inProceedings of the 19th annual international conference on mobile systems, applications, and services, 2021, pp. 54–66

  14. [14]

    Rfid and camera fusion for recognition of human-object interactions,

    X. Liu, D. Liu, J. Zhang, T. Gu, and K. Li, “Rfid and camera fusion for recognition of human-object interactions,” inProceedings of the 27th Annual International Conference on Mobile Computing and Networking, 2021, pp. 296–308

  15. [15]

    Momentum contrast for unsupervised visual representation learning,

    K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738

  16. [16]

    Headfi: bringing intelligence to all headphones,

    X. Fan, L. Shangguan, S. Rupavatharam, Y . Zhang, J. Xiong, Y . Ma, and R. Howard, “Headfi: bringing intelligence to all headphones,” in Proceedings of the 27th Annual International Conference on Mobile Computing and Networking, 2021, pp. 147–159

  17. [17]

    isleep: unobtrusive sleep quality monitoring using smartphones,

    T. Hao, G. Xing, and G. Zhou, “isleep: unobtrusive sleep quality monitoring using smartphones,” inProceedings of the 11th ACM Conference on Embedded Networked Sensor Systems, 2013, pp. 1– 14

  18. [18]

    Cnn-based sensor fusion techniques for multimodal human activity recognition,

    S. M ¨unzner, P. Schmidt, A. Reiss, M. Hanselmann, R. Stiefelhagen, and R. D ¨urichen, “Cnn-based sensor fusion techniques for multimodal human activity recognition,” inProceedings of the 2017 ACM inter- national symposium on wearable computers, 2017, pp. 158–165

  19. [19]

    Sensegan: Enabling deep learning for internet of things with a semi-supervised framework,

    S. Yao, Y . Zhao, H. Shao, C. Zhang, A. Zhang, S. Hu, D. Liu, S. Liu, L. Su, and T. Abdelzaher, “Sensegan: Enabling deep learning for internet of things with a semi-supervised framework,”Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, vol. 2, no. 3, pp. 1–21, 2018

  20. [20]

    Hmgan: A hierarchical multi-modal generative adversarial network model for wearable human activity recognition,

    L. Chen, R. Hu, M. Wu, and X. Zhou, “Hmgan: A hierarchical multi-modal generative adversarial network model for wearable human activity recognition,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 7, no. 3, pp. 1–27, 2023

  21. [21]

    A survey on generative adversar- ial networks: Variants, applications, and training,

    A. Jabbar, X. Li, and B. Omar, “A survey on generative adversar- ial networks: Variants, applications, and training,”ACM Computing Surveys (CSUR), vol. 54, no. 8, pp. 1–49, 2021

  22. [22]

    A survey on unsupervised learning for wearable sensor-based activity recognition,

    A. O. Ige and M. H. M. Noor, “A survey on unsupervised learning for wearable sensor-based activity recognition,”Applied Soft Computing, vol. 127, p. 109363, 2022

  23. [23]

    Cocoa: Cross modality contrastive learning for sensor data,

    S. Deldari, H. Xue, A. Saeed, D. V . Smith, and F. D. Salim, “Cocoa: Cross modality contrastive learning for sensor data,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technolo- gies, vol. 6, no. 3, pp. 1–28, 2022

  24. [24]

    Cosmo: contrastive fusion learning with small data for multimodal human activity recognition,

    X. Ouyang, X. Shuai, J. Zhou, I. W. Shi, Z. Xie, G. Xing, and J. Huang, “Cosmo: contrastive fusion learning with small data for multimodal human activity recognition,” inProceedings of the 28th Annual In- ternational Conference on Mobile Computing And Networking, 2022, pp. 324–337

  25. [25]

    Mesen: Exploit multimodal data to design unimodal human activity recognition with few labels,

    L. Xu, C. Gu, R. Tan, S. He, and J. Chen, “Mesen: Exploit multimodal data to design unimodal human activity recognition with few labels,” inProceedings of the 21st ACM Conference on Embedded Networked Sensor Systems, 2023, pp. 1–14

  26. [26]

    Babel: A scalable pre-trained model for multi-modal sensing via expandable modality alignment,

    S. Dai, S. Jiang, Y . Yang, T. Cao, M. Li, S. Banerjee, and L. Qiu, “Babel: A scalable pre-trained model for multi-modal sensing via expandable modality alignment,” inProceedings of the 23rd ACM Conference on Embedded Networked Sensor Systems, 2025, pp. 240– 253

  27. [27]

    Differential transformer, 2024

    T. Ye, L. Dong, Y . Xia, Y . Sun, Y . Zhu, G. Huang, and F. Wei, “Differential transformer,”arXiv preprint arXiv:2410.05258, 2024

  28. [28]

    Contrastive multiview coding,

    Y . Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” inEuropean conference on computer vision. Springer, 2020, pp. 776–794

  29. [29]

    Crossl: Cross-modal self-supervised learning for time- series through latent masking,

    S. Deldari, D. Spathis, M. Malekzadeh, F. Kawsar, F. D. Salim, and A. Mathur, “Crossl: Cross-modal self-supervised learning for time- series through latent masking,” inProceedings of the 17th ACM International Conference on Web Search and Data Mining, 2024, pp. 152–160

  30. [30]

    Spatial-temporal masked autoencoder for multi-device wearable human activity recognition,

    S. Miao, L. Chen, and R. Hu, “Spatial-temporal masked autoencoder for multi-device wearable human activity recognition,”Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., vol. 7, no. 4, Jan. 2024. [Online]. Available: https://doi.org/10.1145/3631415

  31. [31]

    Master: A multi-modal foundation model for human activity recog- nition,

    G. Zhu, D. Zhao, C. Li, M. Zhao, Z. Zhang, H. Quan, and H. Ma, “Master: A multi-modal foundation model for human activity recog- nition,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 9, no. 3, pp. 1–26, 2025

  32. [32]

    Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor,

    C. Chen, R. Jafari, and N. Kehtarnavaz, “Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor,” in2015 IEEE International conference on image processing (ICIP). IEEE, 2015, pp. 168–172

  33. [33]

    Introducing a new benchmarked dataset for activity monitoring,

    A. Reiss and D. Stricker, “Introducing a new benchmarked dataset for activity monitoring,” in2012 16th international symposium on wearable computers. IEEE, 2012, pp. 108–109

  34. [34]

    Complex human activity recognition using smartphone and wrist- worn motion sensors,

    M. Shoaib, S. Bosch, O. D. Incel, H. Scholten, and P. J. Havinga, “Complex human activity recognition using smartphone and wrist- worn motion sensors,”Sensors, vol. 16, no. 4, p. 426, 2016