Modular Retrieval-Augmented Generalization for Human Action Recognition
Pith reviewed 2026-05-12 00:52 UTC · model grok-4.3
The pith
A plug-in retrieval module for motion signals improves accuracy in IMU-based human activity recognition models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MoRA is presented as the first retrieval-augmented module designed specifically for motion series that integrates flexibly into any existing HAR model. The module counters information redundancy and rigid fusion by means of an uncertainty-adaptive fusion unit that uses prior physical knowledge from IMU signals to dynamically balance original model outputs against retrieved sequences. Experiments across ten real-world datasets establish that this produces consistent, stable performance gains for baseline models while keeping inference efficient.
What carries the argument
The uncertainty-adaptive fusion unit inside MoRA, which uses physical IMU knowledge to dynamically adjust the weighting between original outputs and retrieved motion information.
Load-bearing premise
That retrieved motion sequences supply useful complementary information without introducing excessive redundancy and that the uncertainty-adaptive fusion unit can reliably adjust the combination using IMU physical knowledge without adding errors.
What would settle it
Integrating MoRA into a baseline HAR model on one or more of the ten datasets and measuring no accuracy increase or an accuracy decrease would falsify the claim of consistent gains.
Figures
read the original abstract
Inertial Measurement Unit (IMU)-based Human Activity Recognition (HAR) aims to interpret and classify user behaviors from temporal motion signals. Recently, deep learning frameworks have advanced this task by learning and extracting discriminative spatiotemporal representations, significantly improving recognition performance. However, IMU-based HAR still faces several critical challenges, particularly limited training samples and static knowledge utilization, both of which severely hinder its large-scale deployment. In this paper, we introduce MoRA, the first Retrieval-Augmented Module specifically designed for motion series. It can be flexibly integrated into any existing HAR model, enhancing recognition performance while maintaining inference efficiency. To address issues such as information redundancy in retrieval results and rigid fusion strategies, we propose an uncertainty-adaptive fusion unit within MoRA. This unit leverages previous physical knowledge from IMU signals to dynamically adjust the fusion strategy between original outputs and retrieved information, enabling more robust recognition. Extensive experiments on ten real-world datasets demonstrate that MoRA significantly improves the performance of existing IMU-based HAR models, consistently delivering stable and effective gains. The source code of MoRA is available at: https://github.com/liavonpenn/mora.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes MoRA, a modular retrieval-augmented module for IMU-based Human Activity Recognition (HAR). It can be plugged into existing deep learning HAR models, retrieves relevant motion series from a database, and employs an uncertainty-adaptive fusion unit that uses physical IMU signal knowledge to dynamically balance the original model output against retrieved information. The central claim is that this yields consistent, stable performance gains across ten real-world datasets while preserving inference efficiency; source code is released.
Significance. If the performance improvements are shown to arise from genuine complementary retrieval and adaptive fusion rather than artifacts, MoRA would represent a practical, model-agnostic enhancement for data-limited IMU-HAR settings. The modular design and public code release are clear strengths that aid reproducibility and adoption. The work addresses real challenges of limited samples and static knowledge but requires stronger empirical grounding to realize its potential impact.
major comments (2)
- [Method] Method section (retrieval database construction): The protocol for populating the motion-series retrieval database relative to train/test splits is not specified. IMU-HAR datasets are typically small and subject-specific; without explicit isolation (e.g., database built solely from training subjects/sequences), retrieved items may leak subject identity or activity patterns, which could explain the reported gains instead of the uncertainty-adaptive fusion mechanism.
- [Experiments] Experiments section (results and ablations): The manuscript reports gains on ten datasets but provides no ablation studies isolating the contribution of the uncertainty-adaptive fusion unit, no statistical significance tests across runs or datasets, and insufficient detail on baseline implementations, exact fusion mechanics, or hyperparameter choices. This leaves the central claim of 'stable and effective gains' difficult to verify independently.
minor comments (2)
- [Abstract] Abstract: The ten datasets are not named; explicitly listing them (e.g., in parentheses) would improve immediate clarity for readers.
- [Figure 2] Figure 2 (fusion unit diagram): The uncertainty estimation pathway from IMU signals lacks explicit labels or equations, making the dynamic adjustment process harder to follow.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of clarity and empirical rigor that we agree will strengthen the work. Below we provide point-by-point responses to the major comments and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Method] Method section (retrieval database construction): The protocol for populating the motion-series retrieval database relative to train/test splits is not specified. IMU-HAR datasets are typically small and subject-specific; without explicit isolation (e.g., database built solely from training subjects/sequences), retrieved items may leak subject identity or activity patterns, which could explain the reported gains instead of the uncertainty-adaptive fusion mechanism.
Authors: We appreciate this critical observation regarding potential data leakage. In the implementation underlying all reported results, the retrieval database was constructed exclusively from training subjects and sequences for each dataset, with no overlap to validation or test splits; this was enforced to prevent subject-specific or activity-pattern leakage. However, we acknowledge that the manuscript did not state this protocol explicitly in Section 3. We will revise the method section to include a clear description of the split protocol, a diagram of the data partitioning, and pseudocode for database construction. The released source code already implements this isolation, and we will add documentation confirming it. revision: yes
-
Referee: [Experiments] Experiments section (results and ablations): The manuscript reports gains on ten datasets but provides no ablation studies isolating the contribution of the uncertainty-adaptive fusion unit, no statistical significance tests across runs or datasets, and insufficient detail on baseline implementations, exact fusion mechanics, or hyperparameter choices. This leaves the central claim of 'stable and effective gains' difficult to verify independently.
Authors: We agree that the experimental section would benefit from greater transparency and additional analyses. In the revised manuscript we will add: (1) ablation studies that isolate the uncertainty-adaptive fusion unit by comparing MoRA against variants using fixed-weight fusion, retrieval without fusion, and no retrieval; (2) statistical significance testing (paired t-tests with p-values and standard deviations over five random seeds) for all reported gains; and (3) expanded details on baseline re-implementations, the exact equations for the uncertainty-adaptive fusion, and a comprehensive hyperparameter table. These will appear in the main text and an extended supplementary material to enable independent verification. revision: yes
Circularity Check
No circularity: empirical module evaluated on external datasets
full rationale
The paper presents MoRA as a plug-in retrieval module with an uncertainty-adaptive fusion unit, supported solely by experimental results across ten datasets. No equations, derivations, or first-principles claims appear that reduce performance gains to fitted parameters or self-referential definitions. The approach is described as an empirical augmentation grounded in signal properties and retrieval, with no load-bearing self-citations or ansatzes that collapse the central claim into its inputs by construction. This is the standard non-circular outcome for a modular empirical contribution.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Wireless sensing in artificial intelligence of things: A general quantum machine learning framework,
Peng Liao, Xuyu Wang, Yingxin Shan, Lingling An, and Shiwen Mao, “Wireless sensing in artificial intelligence of things: A general quantum machine learning framework,”IEEE Network, 2025
work page 2025
-
[2]
Recognizing activities of daily living with a wrist-mounted camera,
Katsunori Ohnishi, Atsushi Kanehira, Asako Kanezaki, and Tatsuya Harada, “Recognizing activities of daily living with a wrist-mounted camera,” inCVPR, 2016
work page 2016
-
[3]
Deep learning in human activity recognition with wearable sensors: A review on advances,
Shibo Zhang, Yaxuan Li, Shen Zhang, Farzad Shahabi, Stephen Xia, Yu Deng, and Nabil Alshurafa, “Deep learning in human activity recognition with wearable sensors: A review on advances,”Sensors, 2022
work page 2022
-
[4]
Practically adopting human activity recognition,
Huatao Xu, Pengfei Zhou, Rui Tan, and Mo Li, “Practically adopting human activity recognition,” inProceedings of the 29th Annual Inter- national Conference on Mobile Computing and Networking, 2023
work page 2023
-
[5]
Unimts: Unified pre-training for motion time series,
Xiyuan Zhang, Diyan Teng, Ranak Roy Chowdhury, Shuheng Li, Dezhi Hong, Rajesh Gupta, and Jingbo Shang, “Unimts: Unified pre-training for motion time series,”Advances in Neural Information Processing Systems, 2024
work page 2024
-
[6]
Imagebind: One embedding space to bind them all,
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra, “Imagebind: One embedding space to bind them all,” inCVPR, 2023
work page 2023
-
[7]
Onellm: One framework to align all modalities with language,
Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xiangyu Yue, “Onellm: One framework to align all modalities with language,” inCVPR, 2024
work page 2024
-
[8]
Retrieval- augmented diffusion models for time series forecasting,
Jingwei Liu, Ling Yang, Hongyan Li, and Shenda Hong, “Retrieval- augmented diffusion models for time series forecasting,”Advances in Neural Information Processing Systems, 2024
work page 2024
-
[9]
Learning transferable visual models from natural language supervision,
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transferable visual models from natural language supervision,” inICML, 2021
work page 2021
-
[10]
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives,
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al., “Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives,” in CVPR, 2024
work page 2024
-
[11]
Mmact: A large-scale dataset for cross modal human action understanding,
Quan Kong, Ziming Wu, Ziwei Deng, Martin Klinkigt, Bin Tong, and Tomokazu Murakami, “Mmact: A large-scale dataset for cross modal human action understanding,” inCVPR, 2019
work page 2019
-
[12]
Billion-scale similarity search with gpus,
Jeff Johnson, Matthijs Douze, and Herv ´e J´egou, “Billion-scale similarity search with gpus,”IEEE Transactions on Big Data, 2019
work page 2019
-
[13]
A public domain dataset for human activity recognition using smartphones.,
Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, Jorge Luis Reyes-Ortiz, et al., “A public domain dataset for human activity recognition using smartphones.,” inEsann, 2013
work page 2013
-
[14]
Mobile sensor data anonymization,
Mohammad Malekzadeh, Richard G Clegg, Andrea Cavallaro, and Hamed Haddadi, “Mobile sensor data anonymization,” inProceed- ings of the international conference on internet of things design and implementation, 2019
work page 2019
-
[15]
Fusion of smartphone motion sensors for physical activity recognition,
Muhammad Shoaib, Stephan Bosch, Ozlem Durmaz Incel, Hans Scholten, and Paul JM Havinga, “Fusion of smartphone motion sensors for physical activity recognition,”Sensors, 2014
work page 2014
-
[16]
On-body localization of wearable devices: An investigation of position-aware activity recogni- tion,
Timo Sztyler and Heiner Stuckenschmidt, “On-body localization of wearable devices: An investigation of position-aware activity recogni- tion,” inPerCom, 2016
work page 2016
-
[17]
Introducing a new benchmarked dataset for activity monitoring,
Attila Reiss and Didier Stricker, “Introducing a new benchmarked dataset for activity monitoring,” in2012 16th international symposium on wearable computers, 2012
work page 2012
-
[18]
Usc-had: A daily activity dataset for ubiquitous activity recognition using wearable sensors,
Mi Zhang and Alexander A Sawchuk, “Usc-had: A daily activity dataset for ubiquitous activity recognition using wearable sensors,” in Proceedings of the 2012 ACM conference on ubiquitous computing, 2012
work page 2012
-
[19]
Wisdm smartphone and smartwatch activity and biomet- rics dataset,
Gary M Weiss, “Wisdm smartphone and smartwatch activity and biomet- rics dataset,”UCI Machine Learning Repository: WISDM Smartphone and Smartwatch Activity and Biometrics Dataset Data Set, 2019
work page 2019
-
[20]
Comparative study on classifying human activities with miniature inertial and magnetic sensors,
Kerem Altun, Billur Barshan, and Orkun Tunc ¸el, “Comparative study on classifying human activities with miniature inertial and magnetic sensors,”Pattern Recognition, 2010
work page 2010
-
[21]
Chen Chen, Roozbeh Jafari, and Nasser Kehtarnavaz, “Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor,” inICIP, 2015
work page 2015
-
[22]
Ts2vec: Towards universal representation of time series,
Zhihan Yue, Yujing Wang, Juanyong Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong, and Bixiong Xu, “Ts2vec: Towards universal representation of time series,” inAAAI, 2022
work page 2022
-
[23]
Emadeldeen Eldele, Mohamed Ragab, Zhenghua Chen, Min Wu, and Xiaoli Li, “Tslanet: Rethinking transformers for time series representa- tion learning,”arXiv preprint arXiv:2404.08472, 2024
-
[24]
Mantis: Lightweight calibrated foundation model for user-friendly time series classification
Vasilii Feofanov, Songkang Wen, Marius Alonso, Romain Ilbert, Hongbo Guo, Malik Tiomoko, Lujia Pan, Jianfeng Zhang, and Ievgen Redko, “Mantis: Lightweight calibrated foundation model for user-friendly time series classification,”arXiv preprint arXiv:2502.15637, 2025
-
[25]
Optimal transport for time series imputation
Shiyu Wang, Jiawei Li, Xiaoming Shi, Zhou Ye, Baichuan Mo, Wenze Lin, Shengtong Ju, Zhixuan Chu, and Ming Jin, “Timemixer++: A general time series pattern machine for universal predictive analysis,” arXiv preprint arXiv:2410.16032, 2024
-
[26]
Imu2clip: Multimodal contrastive learning for imu motion sensors from egocentric videos and text,
Seungwhan Moon, Andrea Madotto, Zhaojiang Lin, Alireza Dirafzoon, Aparajita Saraf, Amy Bearman, and Babak Damavandi, “Imu2clip: Multimodal contrastive learning for imu motion sensors from egocentric videos and text,”arXiv preprint arXiv:2210.14395, 2022
-
[27]
Primus: Pretraining imu encoders with multimodal self- supervision,
Arnav M Das, Chi Ian Tang, Fahim Kawsar, and Mohammad Malekzadeh, “Primus: Pretraining imu encoders with multimodal self- supervision,” inICASSP, 2025
work page 2025
-
[28]
Ego4d: Around the world in 3,000 hours of egocentric video,
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al., “Ego4d: Around the world in 3,000 hours of egocentric video,” inCVPR, 2022. APPENDIX A. Related Work We briefly discuss two main lines of related work as follows. Human Activity Recognition:With the r...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.