pith. sign in

arxiv: 2604.16577 · v1 · submitted 2026-04-17 · 💻 cs.CV · cs.AI

Multilevel neural networks with dual-stage feature fusion for human activity recognition

Pith reviewed 2026-05-10 09:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords human activity recognitionfeature fusionmultilevel neural networksCNNLSTMconvolutional LSTMsensor datalate fusion
0
0 comments X

The pith

Combining late and intermediate feature fusion in multilevel neural networks raises accuracy for human activity recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a two-level neural network that fuses outputs from the first level (late fusion) and also merges features between the first and second levels (intermediate fusion) can outperform networks that use only late fusion. It builds and compares 15 architectures built from CNNs, LSTMs, and convolutional LSTMs, applying the dual fusion to each. On two standard public datasets the dual-fusion versions produce higher accuracy than the late-fusion-only versions, and the strongest dual-fusion model also beats established baseline networks. This outcome indicates that the added intermediate stage lets the system draw on complementary information from different depths of processing.

Core claim

A multilevel architecture performs late fusion on the outputs of its first network level and intermediate fusion between features from the first and second levels. Across the 15 evaluated combinations of CNN, LSTM, and convolutional LSTM components, every architecture that includes both fusion stages records higher classification accuracy than its late-fusion-only counterpart on two public benchmark datasets for human activity recognition. The single best dual-stage configuration also exceeds the accuracy of previously reported baseline models.

What carries the argument

The dual-stage fusion mechanism that applies late fusion to the outputs of the first network level and intermediate fusion to combine features from both levels.

If this is right

  • Hybrid networks that use both fusion stages can serve as stronger baselines than late-fusion-only hybrids for sensor-based activity tasks.
  • The benefit of adding intermediate fusion appears across multiple base network types, suggesting the gain is not limited to one architecture family.
  • Identifying the best dual-fusion configuration among the fifteen tested supplies a concrete, ready-to-use model for future human activity recognition work.
  • Late fusion alone is shown to be suboptimal once an intermediate fusion path is available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the pattern holds on new data, designers could favor shallower multilevel stacks with dual fusion over deeper single-stage networks.
  • The same dual-fusion pattern might improve performance in related multi-sensor or multimodal classification problems outside activity recognition.
  • An adaptive version that decides when to apply the intermediate fusion could be tested as a direct next step.

Load-bearing premise

The specific way late and intermediate fusion are combined extracts genuinely complementary information that improves results beyond the two tested datasets.

What would settle it

Running the top dual-fusion architecture on a third, previously unused human-activity dataset and obtaining no accuracy gain over its late-fusion-only counterpart would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.16577 by Abeer FathAllah Brery, Ascensi\'on Gallardo-Antol\'in, Israel Gonzalez-Carrasco, Mahmoud Fakhry.

Figure 1
Figure 1. Figure 1: Block diagram of the proposed network architecture. Subsequently, the feature maps are processed by a second neural network, which introduces an additional abstraction layer into the learned features. To further refine the extracted features, global average pooling is applied to the outputs of both the first and second networks. This operation produces compact yet informative representations by summarizing… view at source ↗
Figure 2
Figure 2. Figure 2: Block diagram of convolutional LSTM network. 3.4. Global average pooling Global Average Pooling (GAP) is a pooling operation used in convolutional neural networks (CNNs) that reduces the spatial dimensions of feature maps to a single value per channel. Unlike traditional pooling methods, such as max or average pooling, which partially reduce dimensionality, global average pooling collapses each feature map… view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy of different architectures on the raw sensor readings of the dataset USC￾HAD with late feature fusion. 1D CNN LSTM 1D CLSTM 91.5 92 92.5 93 93.5 94 94.5 95 94.4 94 92.1 93.8 93 92.9 93.7 92.4 94.3 93.4 93.2 93.7 92.9 92.3 93 First network Accuracy (%) 1D CNN 2D CNN 1D CLSTM 2D CLSTM LSTM [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy of different architectures on the raw sensor readings of the dataset USC￾HAD with late and intermediate feature fusion. Applied Computing and Intelligence Volume 5, Issue 2, xxx–xxx [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy of different architectures on the raw sensor readings of the dataset UCI￾HAR with late feature fusion. 1D CNN LSTM 1D CLSTM 82 84 86 88 90 85.7 83.4 83.7 88.5 88.9 87.2 85.5 84.6 83.4 86.8 87.9 86.5 85.7 83.8 84.9 First network Accuracy (%) 1D CNN 2D CNN 1D CLSTM 2D CLSTM LSTM [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy of different architectures on the raw sensor readings of the dataset UCI￾HAR with late and intermediate feature fusion. Applied Computing and Intelligence Volume 5, Issue 2, xxx–xxx [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Human activity recognition (HAR) refers to the process of identifying human actions and activities using data collected from sensors. Neural networks, such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, convolutional LSTM, and their hybrid combinations, have demonstrated exceptional performance in various research domains. Developing a multilevel individual or hybrid model for HAR involves strategically integrating multiple networks to capitalize on their complementary strengths. The structural arrangement of these components is a critical factor influencing the overall performance. This study explores a novel framework of a two-level network architecture with dual-stage feature fusion: late fusion, which combines the outputs from the first network level, and intermediate fusion, which integrates the features from both the first and second levels. We evaluated $15$ different network architectures of CNNs, LSTMs, and convolutional LSTMs, incorporating late fusion with and without intermediate fusion, to identify the optimal configuration. Experimental evaluation on two public benchmark datasets demonstrates that architectures incorporating both late and intermediate fusion achieve higher accuracy than those relying on late fusion alone. Moreover, the optimal configuration outperforms baseline models, thereby validating its effectiveness for HAR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes a two-level neural network framework for human activity recognition that integrates dual-stage feature fusion: late fusion of outputs from the first network level and intermediate fusion of features across levels. It evaluates 15 architectures (CNN, LSTM, ConvLSTM and hybrids) with late fusion alone versus late-plus-intermediate fusion on two public benchmark datasets, claiming that the dual-fusion variants achieve higher accuracy and that the best configuration outperforms unspecified baseline models.

Significance. If the empirical superiority is confirmed with complete protocols and controls, the work would indicate that adding an intermediate fusion stage can reliably improve hybrid network performance in sensor-based HAR by exploiting complementary representations. This could guide practical architecture design for activity recognition, though the current absence of methodological transparency limits its immediate utility and generalizability.

major comments (3)
  1. Abstract: the claim that dual-stage fusion yields higher accuracy than late fusion alone is only partially supported because the abstract provides no details on training protocols, hyperparameter selection, statistical tests, error bars, or baseline definitions, leaving the central empirical result unsubstantiated.
  2. Experimental evaluation: testing 15 architectures on fixed benchmarks without reporting all pairwise results or confirming that configurations were pre-specified and selected without test-data access creates a risk of cherry-picking or search bias, directly threatening the validity of the asserted dual-fusion advantage.
  3. Abstract and evaluation: the generalization that the dual-stage strategy captures complementary strengths is weakened by restriction to two datasets and 15 configurations; no cross-dataset validation, ablation on fusion components, or variance analysis is described to rule out dataset-specific tuning or overfitting.
minor comments (1)
  1. Abstract: inline LaTeX notation such as $15$ should be replaced with plain text or properly typeset numbers for readability in all submission formats.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of methodological transparency and empirical rigor. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract: the claim that dual-stage fusion yields higher accuracy than late fusion alone is only partially supported because the abstract provides no details on training protocols, hyperparameter selection, statistical tests, error bars, or baseline definitions, leaving the central empirical result unsubstantiated.

    Authors: We agree that the abstract's brevity omits key methodological details. The full manuscript (Sections 3.3 and 4.1) describes the training protocols (Adam optimizer with cross-entropy loss and early stopping on validation accuracy), hyperparameter selection via grid search on validation sets, and baseline comparisons. We will revise the abstract to include a concise statement on the evaluation setup, such as noting standard train/test splits on public benchmarks and reporting mean accuracies. Statistical significance tests and error bars will be added to the main results tables in the revision. revision: yes

  2. Referee: Experimental evaluation: testing 15 architectures on fixed benchmarks without reporting all pairwise results or confirming that configurations were pre-specified and selected without test-data access creates a risk of cherry-picking or search bias, directly threatening the validity of the asserted dual-fusion advantage.

    Authors: To eliminate any perception of cherry-picking, the revised manuscript will include a complete table with pairwise accuracy results for all 15 architectures under both late-fusion-only and dual-stage fusion conditions. The 15 architectures were pre-specified based on common HAR models in the literature (CNN, LSTM, ConvLSTM hybrids). We will explicitly document that architecture selection and hyperparameter tuning used only training and validation data, with test sets held out until final evaluation. This protocol clarification will be added to the experimental setup section. revision: yes

  3. Referee: Abstract and evaluation: the generalization that the dual-stage strategy captures complementary strengths is weakened by restriction to two datasets and 15 configurations; no cross-dataset validation, ablation on fusion components, or variance analysis is described to rule out dataset-specific tuning or overfitting.

    Authors: We acknowledge the evaluation scope is limited to two datasets. In the revision, we will add an ablation study isolating the intermediate fusion component's contribution. Variance analysis will be included by reporting standard deviations across multiple random seeds. We will also conduct and report cross-dataset validation experiments (training on one dataset and testing on the other) to assess generalizability, while noting potential sensor configuration differences between datasets. These changes will better substantiate the complementary strengths claim. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation without derivations

full rationale

The paper contains no equations, mathematical derivations, or first-principles claims. It describes training and testing 15 neural network configurations (CNN, LSTM, ConvLSTM hybrids) with late and intermediate fusion on two fixed public benchmark datasets, then reports accuracy comparisons. All results are direct experimental outputs rather than quantities defined in terms of fitted parameters or self-referential inputs. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results occur. The work is self-contained as an empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described. The work implicitly relies on standard deep-learning assumptions such as gradient-based optimization converging to useful solutions.

pith-pipeline@v0.9.0 · 5512 in / 1130 out tokens · 42148 ms · 2026-05-10T09:03:28.924339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

  1. [1]

    S. G. Dhekane, T. Ploetz, Transfer learning in human activity recognition: a survey, arXiv: 2401.10185. https://doi.org/10.48550/arXiv.2401.10185

  2. [2]

    V . Soni, S. Jaiswal, V . B. Semwal, B. Roy, D. K. Choubey, D. K. Mallick, An enhanced deep learning approach for smartphone-based human activity recognition in ioht, InMachine learning, image processing, network security and data sciences: select proceedings of 3rd international conference on MIND 2021, Singapore: Springer, 2023, 505–516. https://doi.org...

  3. [3]

    Saini, A

    S. Saini, A. Juneja, A. Shrivastava, Human activity recognition using deep learning: past, present and future,Proceedings of 1st International Conference on Intelligent Computing and Research Trends (ICRT), 2023, 1–6. https://doi.org/10.1109/ICRT57042.2023.10146621

  4. [4]

    Mekruksavanich, A

    S. Mekruksavanich, A. Jitpattanakul, The deep learning-based human activity recognition using smart wearable sensors: a tutorial,ReBICTE,8(2022), 1. https://doi.org/10.22667/ReBiCTE.2022.02.28.001 Applied Computing and IntelligenceV olume 5, Issue 2, xxx–xxx. 14

  5. [5]

    Ramanujam, Thinagaran Perumal, and S

    E. Ramanujam, T. Perumal, S. Padmavathi, Human activity recognition with smartphone and wearable sensors using deep learning techniques: a review,IEEE Sens. J.,21(2021), 13029– 13040. https://doi.org/10.1109/JSEN.2021.3069927

  6. [6]

    K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, 770–778. https://doi.org/10.1109/CVPR.2016.90

  7. [7]

    C. Han, L. Zhang, Y . Tang, W. Huang, F. Min, J. He, Human activity recognition using wearable sensors by heterogeneous convolutional neural networks,Expert Syst. Appl., vol.198(2022), 116764. https://doi.org/10.1016/j.eswa.2022.116764

  8. [8]

    Y . Li, J. Wu, W. Li, A. Fang, W. Dong, Temporal-spatial dynamic convolutional neural network for human activity recognition using wearable sensors,IEEE Trans. Instrum. Meas.,72(2023), 2516912. https://doi.org/10.1109/TIM.2023.3279908

  9. [9]

    J. Sena, J. Barreto, C. Caetano, G. Cramer, W. R. Schwartz, Human activity recognition based on smartphone and wearable sensors using multiscale dcnn ensemble,Neurocomputing,444(2021), 226–243. https://doi.org/10.1016/j.neucom.2020.04.151

  10. [10]

    Huang, W

    Q. Huang, W. Xie, C. Li, Y . Wang, Y . Liu, Human action recognition based on hierarchical multi-scale adaptive conv-long short-term memory network,Appl. Sci.,13(2023), 10560. https://doi.org/10.3390/app131910560

  11. [11]

    Sethi, M

    M. Sethi, M. Yadav, M. Singh, P. G. Shambharkar, Attnhar: human activity recognition using data collected from wearable sensors,Proceedings of 6th International Conference on Information Systems and Computer Networks (ISCON), 2023, 1–6. https://doi.org/10.1109/ISCON57294.2023.10112183

  12. [12]

    S. P. Singh, M. K. Sharma, A. Lay-Ekuakille, D. Gangwar, S. Gupta, Deep convlstm with self- attention for human activity decoding using wearable sensors,IEEE Sens. J.,21(2021), 8575–

  13. [13]

    https://doi.org/10.1109/JSEN.2020.3045135

  14. [14]

    L. Wang, R. Liu, Human activity recognition based on wearable sensor using hierarchical deep lstm networks,Circuits Syst. Signal Process.,39(2020), 837–856. https://doi.org/10.1007/s00034-019-01116-y

  15. [15]

    Ahmad, M

    W. Ahmad, M. Kazmi, H. Ali, Human activity recognition using multi-head cnn followed by lstm,Proceedings of 15th International Conference on Emerging Technologies (ICET), 2019, 1–6. https://doi.org/10.1109/ICET48972.2019.8994412

  16. [16]

    Kolkar, V

    R. Kolkar, V . Geetha, Human activity recognition in smart home using deep learning techniques, Proceedings of 13th International conference on information&communication technology and system (ICTS), 2021, 230–234. https://doi.org/10.1109/ICTS52701.2021.9609044

  17. [17]

    J. X. Goh, K. M. Lim, C. P. Lee, 1d convolutional neural network with long short-term memory for human activity recognition,Proceedings of IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), 2021, 1–6. https://doi.org/10.1109/IICAIET51634.2021.9573979

  18. [18]

    M. M. Islam, S. Nooruddin, F. Karray, G. Muhammad, Multi-level feature fusion for multimodal human activity recognition in internet of healthcare things,Inform. Fusion,94(2023), 17–31. https://doi.org/10.1016/j.inffus.2023.01.015 Applied Computing and IntelligenceV olume 5, Issue 2, xxx–xxx. 15

  19. [19]

    X. Shi, Z. Chen, H. Wang, D. Y . Yeung, W.-K. Wong, W. chun Woo, Convolutional lstm network: a machine learning approach for precipitation nowcasting,Proceedings of 29th Annual Conference on Neural Information Processing Systems, 2015, 802-810

  20. [20]

    G. Alam, I. McChesney, P. Nicholl, J. Rafferty, Open data sets in human activity recognition research-issues and challenges: a review,IEEE Sens. J.,23(2023), 26952–26980. https://doi.org/10.1109/JSEN.2023.3317645

  21. [21]

    Zhang, A

    M. Zhang, A. A. Sawchuk, Usc-had: a daily activity dataset for ubiquitous activity recognition using wearable sensors,Proceedings of the 2012 ACM Conference on Ubiquitous Computing, 2012, 1036–1043. https://doi.org/10.1145/2370216.2370438

  22. [22]

    Anguita, A

    D. Anguita, A. Ghio, L. Oneto, X. Parra, J. L. Reyes-Ortiz, A public domain dataset for human activity recognition using smartphones,Proceedings of European Symposium on Artificial Neural Networks, Computational Intelligenceand Machine Learning, 2013, 437–442

  23. [23]

    M. M. Islam, S. Nooruddin, F. Karray, G. Muhammad, Human activity recognition using tools of convolutional neural networks: A state of the art review, data sets, challenges, and future prospects,Comput. Biol. Med.,149(2022), 106060. https://doi.org/10.1016/j.compbiomed.2022.106060

  24. [24]

    doi:10.1007/978-3-030-32644-9 , isbn =

    A. Ghosh, A. Sufian, F. Sultana, A. Chakrabarti, D. De, Fundamental concepts of convolutional neural network, In:Recent trends and advances in artificial intelligence and internet of things, Cham: Springer, 2019, 519–567. https://doi.org/10.1007/978-3-030-32644-9 36

  25. [25]

    Long short-term memory

    S. Hochreiter, J. Schmidhuber, Long short-term memory,Neural Comput.,9(1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

  26. [26]

    J. Opitz, A closer look at classification evaluation metrics and a critical reflection of common evaluation practice,Transactions of the Association for Computational Linguistics,12(2024), 820–836. https://doi.org/10.1162/tacl a 00675 ©2025 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commo...