pith. sign in

arxiv: 2606.27886 · v1 · pith:NUHPWIHZnew · submitted 2026-06-26 · 💻 cs.LG

A Comparison of Fusion Techniques for Multi-Modal Human Activity Recognition on the HARMES Dataset

Pith reviewed 2026-06-29 04:29 UTC · model grok-4.3

classification 💻 cs.LG
keywords multi-modal fusionhuman activity recognitionsensor fusiongated fusionHARMES datasetIMUaudiohumidity
0
0 comments X

The pith

Gated Multi-modal Fusion reaches 0.82 macro F1 on the HARMES dataset, six points above late-fusion concatenation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper systematically tests seven sensor fusion methods on the same multi-modal architecture using the HARMES dataset of 61 hours of IMU, audio, and humidity recordings for 15 household activities. It reports that Gated Multi-modal Fusion produces the highest macro F1-score of 0.82 under leave-one-participant-out evaluation, beating the concatenation-based late fusion baseline of 0.76 by six percentage points. A sympathetic reader would care because clearer guidance on how to combine wearable sensor streams can raise accuracy in practical daily-living recognition tasks without requiring new sensor hardware or model redesigns.

Core claim

By applying the seven different fusion techniques to a state-of-the-art multi-modal model architecture on the HARMES dataset, which comprises 61 hours of fully labeled IMU, audio, and ambient humidity data for 15 household and personal hygiene activities, we show that Gated Multi-modal Fusion achieves the highest macro F1-score (0.82), surpassing the concatenation-based late fusion HARMES paper baseline of 0.76 by +6pp under leave-one-participant-out evaluation.

What carries the argument

Gated Multi-modal Fusion, a mechanism that uses learned gates to dynamically control the contribution of each modality (IMU, audio, humidity) before or during combination in the shared model.

Load-bearing premise

The seven fusion techniques were applied to an identical state-of-the-art multi-modal model architecture with no architecture changes that inadvertently favor one fusion method over others.

What would settle it

Re-training the identical seven fusion variants after swapping the underlying feature extractor or adding modality-specific branches and observing whether the 0.82 score gap disappears.

Figures

Figures reproduced from arXiv: 2606.27886 by Ahmed Mohamady, Kristof Van Laerhoven, Robin Burchard.

Figure 1
Figure 1. Figure 1: Overview of the multi-modal pipeline. Each modality is encoded into a 128- dimensional embedding by a dedicated encoder. The three embeddings are then com￾bined by one of seven interchangeable fusion methods to predict the activity class. The fusion block is held generic here. The seven concrete architectures are detailed in Section 3.3 and Appendix A. paper [8]), yielding 21,897 windows. Each window takes… view at source ↗
Figure 2
Figure 2. Figure 2: Fusion method comparison on 3-fold group cross-validation (macro F1). All methods use the three modalities jointly. The dashed line marks the best unimodal baseline (AST). as there is little characteristic sound. The overall confounding pair of Apply hand cream and Disinfecting hands is handled better by GMF as well, although notably, while the confusion of Disinfecting hands with Apply hand cream is reduc… view at source ↗
Figure 3
Figure 3. Figure 3: Per-class confusion-matrix difference, GMF minus AST (best unimodal). Note the interpretation: Red, positive values on the diagonal mark classes recognised more reliably under fusion. Blue, negative off-diagonal entries mark confusions that fusion removes. Blue, negative values on the diagonal, or off-diagonal red values mean that the GMF model performed worse than the unimodal AST for the specific confusi… view at source ↗
Figure 4
Figure 4. Figure 4: Per-class macro F1 scores for each fusion strategy. Results presented are av￾eraged over the three-fold CV. Models are sorted in descending order by their global performance, from left to right. (0.94), making tea and brushing teeth (0.91), and washing dishes and window cleaning (0.90). Where LOPO falls slightly behind, the gap is confined to the low-support self-care classes, disinfecting hands and applyi… view at source ↗
Figure 5
Figure 5. Figure 5: Pooled confusion matrix for GMF under leave-one-participant-out evaluation, aggregated over all 20 held-out participants. Audio behaves in the opposite way. AST is essentially handedness-invariant and in fact performs marginally better on the left-handers (0.74 versus 0.73), since a microphone picks up the same activity regardless of which hand performs it. Fusion inherits this robustness and recovers what… view at source ↗
Figure 6
Figure 6. Figure 6: Heatmap table showing per-participant macro F1 scores for each participant and model (3-fold CV). Left-handed participants are marked in red font. The three leftmost models are unimodal: TinyHAR (IMU), AST (Audio), TSMixer (humidity). The worst results on each left-handed participant are marked in red, and the best results on them are marked in green [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: 3-Fold CV macro F1 performance per model, split by dominant hand into left-handers and right-handers. The plot shows both unimodal (AST: audio, TSMixer: humidity, TinyHAR: IMU) and multi-modal methods (all others), sorted by perfor￾mance gap between groups of left-handers and right-handers, descending from left to right. 5 Discussion Simple fusion outperforms complex fusion From the performance results, we… view at source ↗
Figure 8
Figure 8. Figure 8: Gated Multi-modal Fusion (GMF) FC(384 -> 256) Dropout(0.3) GELU LayerNorm FC(256 -> 16) [𝐞imu ,𝐞audio, 𝐞hum] ∈ℝ384 Late Fusion (concatenation) 𝒚∈ℝ16 (16 classes) [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Late Fusion (concatenation) 𝒚∈ℝ16 (16 classes) CA (Q←Audio, K,V←IMU) CA (Q←Audio, K,V←Hum) + LayerNorm Linear (128 → 256) FC (768→256) LayerNorm GELU Dropout (0.1) FC (256→16) CA (Q←IMU, K,V←Audio) CA (Q←IMU, K,V←Hum) + LayerNorm Linear (128 → 256) CA (Q←Hum, K,V←IMU) CA (Q←Hum, K,V←Audio) + LayerNorm Linear (128 → 256) [himu ,haudio, hhum] ∈ℝ768 ℝ256 ℝ256 ℝ256 x2 Layers Cross-Modal Attention (CMA) [PITH_… view at source ↗
Figure 10
Figure 10. Figure 10: Cross-Modal Attention (CMA) [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: CLS-Token Transformer 𝒚∈ℝ16 (16 classes) Linear (128→256) Linear (128→256) Linear (128→256) Bottleneck tokens (4 x ℝ256 ) LayerNorm Mean-pool ℝ256 FC(256 -> 16) Multimodal Bottleneck Transformer (MBT) Transformer Encoder Layer (IMU + Bottleneck) pre-LN • 8 heads • FFN 1024 Transformer Encoder Layer (Audio + Bottleneck) pre-LN • 8 heads • FFN 1024 Transformer Encoder Layer (Humidity + Bottleneck) pre-LN • … view at source ↗
Figure 12
Figure 12. Figure 12: Multi-modal Bottleneck Transformer Linear 128→16 ReLU Dropout(0.3) [h,1] ∈ℝ 65 Wimu tanh Linear 128→16 ReLU Dropout(0.3) [h,1] ∈ℝ 65 Waudio tanh Linear 128→16 ReLU Dropout(0.3) [h,1] ∈ℝ 65 Whum tanh Wout ∈ℝ32x16 𝒚∈ℝ16 (16 classes) fused ∈ ℝ32 Low-Rank Multimodal Fusion (LMF) [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Low-Rank multi-modal Fusion [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Decision Fusion B Additional Machine Learning Results B.1 Confusion Matrices In this section, we show additional confusion matrices. We include one for each unimodal model (AST, TinyHAR, TSMixer), as well as the 3-Fold confusion ma￾trix of the best performing model (GMF), and the confusion difference between TinyHAR and GMF [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
read the original abstract

Recent advances in Human Activity Recognition (HAR) from wearable sensors have shown that multi-modal deep learning models consistently outperform their uni-modal counterparts. Modalities can include IMUs, RGB cameras, audio signals, and others. One important aspect of multi-modal deep learning is the sensor fusion approach we apply. Over recent years, multiple fusion paradigms have been proposed for multi-modal HAR. However, to the best of our knowledge, no head-to-head comparison of these paradigms exists on a common multi-modal HAR benchmark dataset. To address this research gap, we systematically compare seven state-of-the-art sensor fusion methods on the recently released HARMES dataset, which comprises 61 hours of fully labeled IMU, audio, and ambient humidity data. The chosen dataset focuses on 15 household and personal hygiene activities of daily living (ADLs). By applying the seven different fusion techniques to a state-of-the-art multi-modal model architecture, we show that Gated Multi-modal Fusion achieves the highest macro F1-score (0.82), surpassing the concatenation-based late fusion HARMES paper baseline of 0.76 by +6pp under leave-one-participant-out evaluation. All code used in our experiments is made publicly available on GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript performs a head-to-head empirical comparison of seven sensor fusion techniques (including gated multi-modal fusion and concatenation-based late fusion) applied to a multi-modal HAR model on the HARMES dataset (IMU + audio + humidity, 15 ADLs). Under leave-one-participant-out evaluation it reports that gated multi-modal fusion attains the highest macro F1 of 0.82, outperforming the HARMES baseline of 0.76 by 6 percentage points. All code is released publicly.

Significance. If the experimental controls are sound, the work supplies a useful, reproducible benchmark for choosing fusion operators in multi-modal wearable HAR. The public GitHub release is a clear strength that enables direct verification of the reported ranking.

major comments (2)
  1. [Abstract] Abstract: the central claim that Gated Multi-modal Fusion outperforms the other six techniques by 6 pp rests on the assertion that all seven methods were applied to “a state-of-the-art multi-modal model architecture” with “no architecture changes.” No explicit confirmation is given that the unimodal feature extractors, their output dimensionalities, the downstream classifier, optimizer schedule, learning-rate scaling, and regularization were held strictly fixed while only the fusion operator was exchanged. If any variant required even modest hyper-parameter adjustments for training stability, the observed ranking could be an artifact rather than an intrinsic property of the fusion method.
  2. [Results] Results (implied by the reported F1 scores): the macro F1 values are presented as single point estimates with neither error bars, standard deviations across random seeds, nor statistical significance tests comparing the seven methods. Without these, it is impossible to determine whether the +6 pp margin is reliable or could be explained by training stochasticity.
minor comments (1)
  1. [§2] The abstract states that the dataset comprises “61 hours of fully labeled” data; a brief table or sentence in §2 confirming the exact number of participants, recording duration per participant, and class distribution would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below. Where the points identify gaps in explicit documentation or statistical reporting, we agree that revisions are warranted and will update the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that Gated Multi-modal Fusion outperforms the other six techniques by 6 pp rests on the assertion that all seven methods were applied to “a state-of-the-art multi-modal model architecture” with “no architecture changes.” No explicit confirmation is given that the unimodal feature extractors, their output dimensionalities, the downstream classifier, optimizer schedule, learning-rate scaling, and regularization were held strictly fixed while only the fusion operator was exchanged. If any variant required even modest hyper-parameter adjustments for training stability, the observed ranking could be an artifact rather than an intrinsic property of the fusion method.

    Authors: The manuscript states that the seven fusion techniques were applied to the same state-of-the-art multi-modal model architecture with no architecture changes. In practice, the unimodal feature extractors, their output dimensionalities, the downstream classifier, optimizer, learning-rate schedule, and regularization were held identical across all variants; only the fusion operator itself was exchanged. We will add an explicit paragraph in the Methods section of the revised manuscript confirming these controls in detail to remove any ambiguity. revision: yes

  2. Referee: [Results] Results (implied by the reported F1 scores): the macro F1 values are presented as single point estimates with neither error bars, standard deviations across random seeds, nor statistical significance tests comparing the seven methods. Without these, it is impossible to determine whether the +6 pp margin is reliable or could be explained by training stochasticity.

    Authors: We agree that single-run point estimates limit the ability to assess variability due to training stochasticity. Leave-one-participant-out evaluation on this dataset is computationally expensive, which is why we initially reported single runs. In the revision we will either (a) rerun all seven methods with three random seeds and report means and standard deviations or (b) add a clear limitations statement explaining the single-run protocol and the rationale. We will also include pairwise statistical significance tests (e.g., McNemar or paired t-tests on per-participant scores) where multiple runs become available. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of fusion methods with measured outcomes

full rationale

The manuscript performs a head-to-head experimental comparison of seven fusion techniques on the HARMES dataset under leave-one-participant-out evaluation. Reported macro F1 scores (e.g., 0.82 for gated fusion vs. 0.76 baseline) are direct measurements on held-out participants, not quantities derived from equations or fitted parameters that reduce to the inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation chain, which is absent because the central claim is observational rather than deductive. The skeptic concern about architecture identity is a validity issue, not a circularity reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that HARMES constitutes a fair, representative benchmark and that the shared model architecture treats all fusion methods equivalently; no new entities are postulated and the only free parameters are standard deep-learning hyperparameters.

free parameters (1)
  • model hyperparameters and training settings
    Deep learning fusion models contain numerous tunable parameters whose values affect the reported F1 scores; abstract does not list them.
axioms (1)
  • domain assumption HARMES dataset and leave-one-participant-out protocol constitute a valid and unbiased benchmark for comparing fusion methods
    All performance claims depend on this dataset and split being representative of real-world multi-modal HAR.

pith-pipeline@v0.9.1-grok · 5756 in / 1170 out tokens · 38523 ms · 2026-06-29T04:29:36.089777+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 35 canonical work pages · 7 internal anchors

  1. [1]

    Sensors19(17), 3808 (Jan 2019).https://doi.org/10.3390/s19173808

    Aguileta, A.A., Brena, R.F., Mayora, O., Molino-Minero-Re, E., Trejo, L.A.: Multi- Sensor Fusion for Activity Recognition—A Survey. Sensors19(17), 3808 (Jan 2019).https://doi.org/10.3390/s19173808

  2. [2]

    Neural Computing and Applications32(14), 10209–10228 (Jul 2020)

    Arevalo, J., Solorio, T., Montes-y-Gómez, M., González, F.A.: Gated multimodal networks. Neural Computing and Applications32(14), 10209–10228 (Jul 2020). https://doi.org/10.1007/s00521-019-04559-1

  3. [3]

    Multimedia Systems16(6), 345–379 (Nov 2010)

    Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: A survey. Multimedia Systems16(6), 345–379 (Nov 2010). https://doi.org/10.1007/s00530-010-0182-0

  4. [4]

    In: Proceedings of the 2025 ACM International Sympo- sium on Wearable Computers

    Bian, S., Liu, M., Rey, V.F., Geißler, D., Lukowicz, P.: TinierHAR: Towards Ultra-Lightweight Deep Learning Models for Efficient Human Activity Recogni- tion on Edge Devices. In: Proceedings of the 2025 ACM International Sympo- sium on Wearable Computers. pp. 163–169. ACM, Espoo Finland (Oct 2025). https://doi.org/10.1145/3715071.3750410

  5. [5]

    Expert Systems with Applications312, 131487 (May 2026).https://doi.org/10.1016/j.eswa.2026.131487

    Bralina, S., Yazici, A., Guan, C., Lee, M.H.: Adaptive bottleneck transformer for multimodal EEG, audio, and vision fusion. Expert Systems with Applications312, 131487 (May 2026).https://doi.org/10.1016/j.eswa.2026.131487

  6. [6]

    ACM Comput

    Bulling, A., Blanke, U., Schiele, B.: A tutorial on human activity recognition using body-worn inertial sensors. ACM Comput. Surv.46(3), 33:1–33:33 (Jan 2014). https://doi.org/10.1145/2499621

  7. [7]

    In: Durmaz Incel, Ö., Qin, J., Bieber, G., Kuijper, A

    Burchard, R., Ali, H., Van Laerhoven, K.: Improved Strategies for Multi-modal Atmospheric Sensing to Augment Wearable IMU-Based Hand Washing Detection. In: Durmaz Incel, Ö., Qin, J., Bieber, G., Kuijper, A. (eds.) Sensor-Based Activity Recognition and Artificial Intelligence, vol. 16292, pp. 308–323. Springer Nature Switzerland, Cham (2026).https://doi.or...

  8. [9]

    effec- tive prior

    Burchard, R., Brückner, P.A., Bock, M., Van Laerhoven, K.: HARMES: A Multi- Modal Dataset for Wearable Human Activity Recognition with Motion, Envi- ronmental Sensing and Sound (Apr 2026).https://doi.org/10.5281/zenodo. 19425719

  9. [10]

    In: Konak, O., Arnrich, B., Bieber, G., Kuijper, A., Fudickar, S

    Burchard, R., Van Laerhoven, K.: Multi-modal Atmospheric Sensing to Aug- ment Wearable IMU-Based Hand Washing Detection. In: Konak, O., Arnrich, B., Bieber, G., Kuijper, A., Fudickar, S. (eds.) Sensor-Based Activity Recognition and Artificial Intelligence. pp. 55–68. Springer Nature Switzerland, Cham (2025). https://doi.org/10.1007/978-3-031-80856-2_4

  10. [11]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin,J.,Chang,M.W.,Lee,K.,Toutanova,K.:BERT:Pre-trainingofDeepBidi- rectional Transformers for Language Understanding (May 2019).https://doi. org/10.48550/arXiv.1810.04805 22 Mohamady and Burchard et al

  11. [12]

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Jun 2021).https://doi.org/10.48550/arXiv.2010.11929

  12. [13]

    In: Pro- ceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

    Ekambaram, V., Jati, A., Nguyen, N., Sinthong, P., Kalagnanam, J.: TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting. In: Pro- ceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. pp. 459–469 (Aug 2023).https://doi.org/10.1145/3580305.3599533

  13. [14]

    Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies7(3), 96:1–96:26 (Sep 2023).https://doi.org/10.1145/3610872

    Gao, Z., wang, Y., Chen, J., Xing, J., Patel, S., Liu, X., Shi, Y.: MMTSA: Multi- Modal Temporal Segment Attention Network for Efficient Human Activity Recog- nition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies7(3), 96:1–96:26 (Sep 2023).https://doi.org/10.1145/3610872

  14. [15]

    Black, and Otmar Hilliges

    Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: ImageBind One Embedding Space to Bind Them All. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 15180– 15190 (Jun 2023).https://doi.org/10.1109/CVPR52729.2023.01457

  15. [16]

    In: Interspeech 2021

    Gong, Y., Chung, Y.A., Glass, J.: AST: Audio Spectrogram Transformer. In: Interspeech 2021. pp. 571–575. ISCA (Aug 2021).https://doi.org/10.21437/ Interspeech.2021-698

  16. [17]

    In: 2016 International Joint Conference on Neural Networks (IJCNN)

    Ha, S., Choi, S.: Convolutional neural networks for human activity recognition using multiple accelerometer and gyroscope sensors. In: 2016 International Joint Conference on Neural Networks (IJCNN). pp. 381–388. IEEE, Vancouver, BC, Canada (Jul 2016).https://doi.org/10.1109/IJCNN.2016.7727224

  17. [18]

    In: Proceedings of the 38th International Conference on Machine Learning

    Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., Carreira, J.: Per- ceiver: General Perception with Iterative Attention. In: Proceedings of the 38th International Conference on Machine Learning. pp. 4651–4664. PMLR (Jul 2021)

  18. [19]

    Koutoupis, S., Zervou, M.A., Kontras, K., Vos, M.D., Tsakalides, P., Tsagkatakis, G.: The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Align- ment (Apr 2026).https://doi.org/10.48550/arXiv.2511.21331

  19. [20]

    Journal of Computer Science and Cybernetics pp

    Le, T.H., Nguyen, T.K., Le, T.A., Delalandre, M., Trung, K.T., Tran, T.H., Pham, C.: Mamba-MHAR: An efficient multimodal framework for human action recog- nition. Journal of Computer Science and Cybernetics pp. 245–264 (Sep 2025). https://doi.org/10.15625/1813-9663/22770

  20. [21]

    Information Fusion 104, 102153 (Apr 2024).https://doi.org/10.1016/j.inffus.2023.102153

    Lee, S., Lim, Y., Lim, K.: Multimodal sensor fusion models for real-time exercise repetition counting with IMU sensors and respiration data. Information Fusion 104, 102153 (Apr 2024).https://doi.org/10.1016/j.inffus.2023.102153

  21. [22]

    IEEE Internet of Things Journal12(3), 2373–2384 (Feb 2025).https://doi.org/10.1109/JIOT.2024.3463405

    Li, S., Zhu, T., Duan, F., Chen, L., Ning, H., Nugent, C., Wan, Y.: HARMamba: Efficient and Lightweight Wearable Sensor Human Activity Recognition Based on Bidirectional Mamba. IEEE Internet of Things Journal12(3), 2373–2384 (Feb 2025).https://doi.org/10.1109/JIOT.2024.3463405

  22. [23]

    Efficient Low-rank Multimodal Fusion with Modality-Specific Factors

    Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A., Morency, L.P.: Efficient Low-rank Multimodal Fusion with Modality-Specific Factors (May 2018). https://doi.org/10.48550/arXiv.1806.00064

  23. [24]

    Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: Pretraining Task-Agnostic Vi- siolinguistic Representations for Vision-and-Language Tasks (Aug 2019).https: //doi.org/10.48550/arXiv.1908.02265

  24. [25]

    3109–3115 (2019)

    Ma, H., Li, W., Zhang, X., Gao, S., Lu, S.: AttnSense: Multi-level Attention Mech- anism For Multimodal Human Activity Recognition pp. 3109–3115 (2019)

  25. [26]

    Proceedings of the ACM on In- Comparison of Fusion Techniques for Multi-Modal HAR 23 teractive, Mobile, Wearable and Ubiquitous Technologies6(3), 1–19 (Sep 2022)

    Mollyn, V., Ahuja, K., Verma, D., Harrison, C., Goel, M.: SAMoSA: Sensing Activities with Motion and Subsampled Audio. Proceedings of the ACM on In- Comparison of Fusion Techniques for Multi-Modal HAR 23 teractive, Mobile, Wearable and Ubiquitous Technologies6(3), 1–19 (Sep 2022). https://doi.org/10.1145/3550284

  26. [27]

    In: Bouamor, H., Pino, J., Bali, K

    Moon, S., Madotto, A., Lin, Z., Saraf, A., Bearman, A., Damavandi, B.: IMU2CLIP: Language-grounded Motion Sensor Translation with Multimodal Con- trastive Learning. In: Bouamor, H., Pino, J., Bali, K. (eds.) Findings of the As- sociation for Computational Linguistics: EMNLP 2023. pp. 13246–13253. Associa- tion for Computational Linguistics, Singapore (Dec...

  27. [28]

    In: Proceedings of the 2017 ACM International Symposium on Wearable Computers

    Münzner, S., Schmidt, P., Reiss, A., Hanselmann, M., Stiefelhagen, R., Dürichen, R.: CNN-based sensor fusion techniques for multimodal human activity recogni- tion. In: Proceedings of the 2017 ACM International Symposium on Wearable Computers. pp. 158–165. ACM, Maui Hawaii (Sep 2017).https://doi.org/10. 1145/3123021.3123046

  28. [29]

    Design Initiative for a 10 TeV pCM Wakefield Collider,

    Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention Bot- tlenecks for Multimodal Fusion (Nov 2022).https://doi.org/10.48550/arXiv. 2107.00135

  29. [30]

    Sensors16(1), 115 (Jan 2016).https://doi.org/10.3390/s16010115

    Ordóñez, F., Roggen, D.: Deep Convolutional and LSTM Recurrent Neural Net- works for Multimodal Wearable Activity Recognition. Sensors16(1), 115 (Jan 2016).https://doi.org/10.3390/s16010115

  30. [31]

    In: Proceedings of the 28th Annual International Conference on Mobile Computing And Networking

    Ouyang, X., Shuai, X., Zhou, J., Shi, I.W., Xie, Z., Xing, G., Huang, J.: Cosmo: Contrastive fusion learning with small data for multimodal human activity recog- nition. In: Proceedings of the 28th Annual International Conference on Mobile Computing And Networking. pp. 324–337. MobiCom ’22, Association for Com- puting Machinery, New York, NY, USA (Oct 202...

  31. [32]

    Proceedings of the AAAI Conference on Artificial Intelligence32(1) (Apr 2018).https://doi.org/10.1609/aaai.v32i1

    Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: FiLM: Visual Rea- soning with a General Conditioning Layer. Proceedings of the AAAI Conference on Artificial Intelligence32(1) (Apr 2018).https://doi.org/10.1609/aaai.v32i1. 11671

  32. [33]

    Information Fusion80, 241–265 (Apr 2022).https://doi.org/10.1016/ j.inffus.2021.11.006

    Qiu, S., Zhao, H., Jiang, N., Wang, Z., Liu, L., An, Y., Zhao, H., Miao, X., Liu, R., Fortino, G.: Multi-sensor information fusion based on machine learning for real applications in human activity recognition: State-of-the-art and research chal- lenges. Information Fusion80, 241–265 (Apr 2022).https://doi.org/10.1016/ j.inffus.2021.11.006

  33. [34]

    In: Proceedings of the 38th International Conference on Machine Learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision. In: Proceedings of the 38th International Conference on Machine Learning. pp. 8748–8763. PMLR (Jul 2021)

  34. [35]

    In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J

    Rahman, W., Hasan, M.K., Lee, S., Bagher Zadeh, A., Mao, C., Morency, L.P., Hoque, E.: Integrating Multimodal Information in Large Pretrained Transformers. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 2359–2369. Association for Computational Ling...

  35. [36]

    IEEE Signal Processing Magazine34(6), 96–108 (Nov 2017)

    Ramachandram,D.,Taylor,G.W.:DeepMultimodalLearning:ASurveyonRecent Advances and Trends. IEEE Signal Processing Magazine34(6), 96–108 (Nov 2017). https://doi.org/10.1109/MSP.2017.2738401

  36. [37]

    In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M

    Tian, Y., Krishnan, D., Isola, P.: Contrastive Multiview Coding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 776–

  37. [38]

    Springer International Publishing, Cham (2020).https://doi.org/10.1007/ 978-3-030-58621-8_45 24 Mohamady and Burchard et al

  38. [39]

    Tsai, Y.H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.P., Salakhutdinov, R.: Multimodal Transformer for Unaligned Multimodal Language Sequences (Jun 2019).https://doi.org/10.48550/arXiv.1906.00295

  39. [40]

    In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Vaezi Joze, H.R., Shaban, A., Iuzzolino, M.L., Koishida, K.: MMTM: Multimodal Transfer Module for CNN Fusion. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13286–13296. IEEE, Seattle, WA, USA (Jun 2020).https://doi.org/10.1109/CVPR42600.2020.01330

  40. [41]

    Attention Is All You Need

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., ukasz Kaiser, Ł., Polosukhin, I.: Attention is All you Need. In: Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017). https://doi.org/10.48550/arXiv.1706.03762

  41. [42]

    Pattern Recogn

    Wang, J., Chen, Y., Hao, S., Peng, X., Hu, L.: Deep learning for sensor-based activity recognition: A survey. Pattern Recogn. Lett.119(C), 3–11 (Mar 2019). https://doi.org/10.1016/j.patrec.2018.02.010

  42. [43]

    The Visual Computer41(7), 5135–5151 (May 2025).https://doi.org/10.1007/s00371-024-03712-9

    Wang, K., Liu, C., Zhang, R.: CMA-SOD: Cross-modal attention fusion network for RGB-D salient object detection. The Visual Computer41(7), 5135–5151 (May 2025).https://doi.org/10.1007/s00371-024-03712-9

  43. [44]

    Knowledge-Based Systems223, 106970 (Jul 2021).https: //doi.org/10.1016/j.knosys.2021.106970

    Yadav,S.K.,Tiwari,K.,Pandey,H.M.,Akbar,S.A.:Areviewofmultimodalhuman activity recognition with special emphasis on classification, applications, challenges and future directions. Knowledge-Based Systems223, 106970 (Jul 2021).https: //doi.org/10.1016/j.knosys.2021.106970

  44. [45]

    Scientific Re- ports16(1), 382 (Dec 2025).https://doi.org/10.1038/s41598-025-29801-w

    Yılmaz,T.A.,Yatbaz,H.Y.,Ever,E.,Yazici,A.:Hierarchicalhumanactivityrecog- nition with fusion of audio and multiple inertial sensor modalities. Scientific Re- ports16(1), 382 (Dec 2025).https://doi.org/10.1038/s41598-025-29801-w

  45. [46]

    In: Palmer, M., Hwa, R., Riedel, S

    Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor Fusion Network for Multimodal Sentiment Analysis. In: Palmer, M., Hwa, R., Riedel, S. (eds.) Pro- ceedings of the 2017 Conference on Empirical Methods in Natural Language Pro- cessing. pp. 1103–1114. Association for Computational Linguistics, Copenhagen, Denmark (Sep 2017).https://doi.org/...

  46. [47]

    Fusion Method

    Zhou, Y., Zhao, H., Huang, Y., Riedel, T., Hefenbrock, M., Beigl, M.: TinyHAR: A Lightweight Deep Learning Model Designed for Human Activity Recognition. In: Proceedings of the 2022 ACM International Symposium on Wearable Computers. pp. 89–93. ACM, Cambridge United Kingdom (Sep 2022).https://doi.org/10. 1145/3544794.3558467 A Fusion Strategy Visualization...