COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition
Pith reviewed 2026-05-23 01:00 UTC · model grok-4.3
The pith
A frozen video encoder distills semantic knowledge into an IMU encoder via a dynamic instance queue, allowing label-free egocentric activity recognition to match supervised performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
COMODO uses a pretrained frozen video encoder to construct a dynamic instance queue that aligns the feature distributions of video and IMU embeddings in a self-supervised manner, enabling the IMU encoder to inherit rich semantic structure from video while remaining efficient for real-world deployment.
What carries the argument
The dynamic instance queue constructed from the frozen video encoder, which aligns IMU embeddings to video semantics without labels.
If this is right
- IMU-based models achieve performance matching or surpassing fully supervised counterparts on multiple egocentric HAR datasets.
- The method yields strong cross-dataset generalization without retraining on target data.
- The framework works with diverse pretrained video encoders and time-series models.
- Label-free training becomes viable for IMU encoders in wearable activity recognition.
Where Pith is reading between the lines
- The same queue-based alignment could be tested on other sensor streams such as audio or pressure data.
- Deployment on resource-constrained wearables would reduce energy draw compared with video pipelines.
- Scaling the teacher to larger video foundation models could further lift IMU performance without additional labeling.
Load-bearing premise
Aligning IMU embeddings to a dynamic instance queue from a frozen pretrained video encoder transfers enough semantic structure to close the performance gap to supervised models.
What would settle it
Training an IMU encoder with COMODO on one egocentric HAR dataset and testing on another where its accuracy falls to the level of a non-distilled baseline IMU model would falsify the transfer claim.
Figures
read the original abstract
The goal of creating intelligent, human-centered wearable systems for continuous activity understanding faces a fundamental trade-off: Egocentric video-based models capture rich semantic information and have demonstrated strong performance in human activity recognition (HAR), but their high power consumption, privacy concerns, and dependence on lighting limit their feasibility for continuous on-device recognition. In contrast, inertial measurement unit (IMU) sensors offer an energy-efficient, privacy-preserving alternative, yet lack large-scale annotated datasets, leading to weaker generalization. To bridge this gap, we propose COMODO, a cross-modal self-supervised distillation framework that transfers semantic knowledge from video to IMU without requiring labels. COMODO leverages a pretrained and frozen video encoder to construct a dynamic instance queue to align the feature distributions of video and IMU embeddings. This enables the IMU encoder to inherit rich semantic structure from video while maintaining its efficiency for real-world applications. Experiments on multiple egocentric HAR datasets show that COMODO consistently improves downstream performance, matching or surpassing fully supervised models, and demonstrating strong cross-dataset generalization. Benefiting from its simplicity and flexibility, COMODO is compatible with diverse pretrained video and time-series models, offering the potential to leverage more powerful teacher and student foundation models in future ubiquitous computing research. The code is available at this repository: https://github.com/cruiseresearchgroup/COMODO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes COMODO, a cross-modal self-supervised distillation method that aligns IMU embeddings to a dynamic instance queue of features from a frozen pretrained video encoder, enabling label-free transfer of semantic structure for egocentric human activity recognition (HAR). The central claim is that this yields consistent downstream performance gains on multiple datasets, matching or surpassing fully supervised IMU models while demonstrating strong cross-dataset generalization; the approach is presented as compatible with various video and time-series backbones, with code released.
Significance. If the core transfer mechanism holds without implicit supervision, the work would offer a practical route to leverage large-scale video pretraining for efficient, privacy-preserving IMU-based HAR in wearable systems. The explicit release of code supports reproducibility and future extensions to stronger foundation models.
major comments (3)
- [§3] §3 (Method, dynamic instance queue and contrastive alignment): the positive-pair selection rule for the queue-based loss is not specified. If positives are drawn from temporally synchronized video-IMU recordings (standard in egocentric datasets), the procedure uses implicit instance-level correspondence even without activity labels; if positives are chosen without any instance link, the alignment reduces to marginal distribution matching whose utility for activity discriminability is not guaranteed. This choice is load-bearing for the claim of label-free semantic transfer.
- [§4] §4 (Experiments): the abstract and method claim matching or surpassing fully supervised models and strong cross-dataset generalization, yet the provided description supplies no quantitative metrics, baseline definitions, dataset sizes, exclusion criteria, or statistical details (error bars, significance tests). Without these, the support for the central performance claim cannot be verified from the manuscript.
- [§3.2] §3.2 (queue construction): the hyperparameters governing instance queue size and update rate are listed as free parameters; the paper should report sensitivity analysis or default values used across all reported experiments, as these directly affect the alignment quality underlying the reported gains.
minor comments (2)
- [Abstract] Abstract: states performance gains without any numbers or references to tables/figures; move at least one key quantitative result (e.g., accuracy delta on a named dataset) into the abstract for immediate clarity.
- [§3] Notation: the distinction between video embedding queue and IMU embedding space should be made explicit with consistent symbols (e.g., v_q vs. i) to avoid reader confusion in the alignment equations.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper accordingly to improve clarity and completeness.
read point-by-point responses
-
Referee: [§3] §3 (Method, dynamic instance queue and contrastive alignment): the positive-pair selection rule for the queue-based loss is not specified. If positives are drawn from temporally synchronized video-IMU recordings (standard in egocentric datasets), the procedure uses implicit instance-level correspondence even without activity labels; if positives are chosen without any instance link, the alignment reduces to marginal distribution matching whose utility for activity discriminability is not guaranteed. This choice is load-bearing for the claim of label-free semantic transfer.
Authors: We appreciate this observation. In COMODO, positive pairs are formed from temporally synchronized video-IMU segments recorded in the same session (standard for egocentric datasets such as Ego4D and others). This provides instance-level correspondence without requiring activity class labels, allowing the contrastive alignment to transfer semantic structure from the frozen video encoder. The method is therefore label-free with respect to semantic annotations while leveraging the natural multimodal pairing present in the data. We will explicitly state this positive-pair construction rule in the revised Section 3. revision: yes
-
Referee: [§4] §4 (Experiments): the abstract and method claim matching or surpassing fully supervised models and strong cross-dataset generalization, yet the provided description supplies no quantitative metrics, baseline definitions, dataset sizes, exclusion criteria, or statistical details (error bars, significance tests). Without these, the support for the central performance claim cannot be verified from the manuscript.
Authors: The full manuscript contains quantitative results in multiple tables (including comparisons to supervised IMU baselines and cross-dataset transfer), along with dataset descriptions. However, we acknowledge that the narrative in Section 4 could be expanded for better verifiability. In the revision we will add explicit statements of all metrics, baseline definitions, dataset sizes, any exclusion criteria, and statistical details (standard deviations and significance tests where applicable) directly in the main text. revision: yes
-
Referee: [§3.2] §3.2 (queue construction): the hyperparameters governing instance queue size and update rate are listed as free parameters; the paper should report sensitivity analysis or default values used across all reported experiments, as these directly affect the alignment quality underlying the reported gains.
Authors: We agree that these hyperparameters warrant explicit reporting. Across all experiments we used a queue size of 4096 and an update rate of 0.1 as defaults. We will add both the default values and a sensitivity analysis (varying queue size and update rate) to the revised Section 3.2 or supplementary material to demonstrate robustness. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper describes an empirical cross-modal distillation method that aligns IMU embeddings to a dynamic queue from a frozen external pretrained video encoder using contrastive-style losses. No equations, derivations, or claims in the provided abstract reduce any result to fitted parameters by construction, self-citations, or renamed inputs. Performance gains are reported from experiments on external datasets rather than mathematical identities. The approach relies on standard self-supervised alignment techniques and independent pretrained models, with no load-bearing steps that loop back to the paper's own inputs or prior self-authored results.
Axiom & Free-Parameter Ledger
free parameters (1)
- instance queue size and update rate
axioms (1)
- domain assumption Features from a pretrained video encoder contain semantic information transferable to IMU time-series via distribution alignment
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
COMODO leverages a pretrained and frozen video encoder to construct a dynamic instance queue, aligning the feature distributions of video and IMU embeddings... L_CE = −∑ P_v(i|Q) log P_x(i|Q)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FIFO queue Q maintains a large pool of video teacher embeddings... softmax over inner product
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild
AnyMo uses physics-grounded IMU simulation over dense body placements, graph encoder pre-training, and LLM alignment to enable setup-agnostic motion modeling, reporting gains on zero-shot HAR, retrieval, and captionin...
Reference graph
Works this paper leans on
-
[1]
A public domain dataset for human activity recognition using smartphones
Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, Jorge Luis Reyes-Ortiz, et al. A public domain dataset for human activity recognition using smartphones. In Esann, pages 3–4, 2013. 1
work page 2013
-
[2]
Is space-time attention all you need for video understanding? In ICML, page 4, 2021
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? In ICML, page 4, 2021. 1, 5, 7
work page 2021
-
[3]
Activitynet: A large-scale video benchmark for human activity understanding
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 1
work page 2015
-
[4]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2, 3
work page 2021
-
[5]
Quo vadis, action recognition? a new model and the kinetics dataset
Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. 1, 5
work page 2017
-
[6]
The opportunity challenge: A bench- mark database for on-body sensor-based activity recognition
Ricardo Chavarriaga, Hesam Sagha, Alberto Calatroni, Sun- dara Tejaswi Digumarti, Gerhard Tr¨oster, Jos´e del R Mill´an, and Daniel Roggen. The opportunity challenge: A bench- mark database for on-body sensor-based activity recognition. Pattern Recognition Letters, 34(15):2033–2042, 2013. 1
work page 2033
-
[7]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on ma- chine learning, pages 1597–1607. PmLR, 2020. 5
work page 2020
-
[8]
Cocoa: Cross modality contrastive learn- ing for sensor data
Shohreh Deldari, Hao Xue, Aaqib Saeed, Daniel V Smith, and Flora D Salim. Cocoa: Cross modality contrastive learn- ing for sensor data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(3):1–28,
-
[9]
Crossl: Cross-modal self-supervised learning for time-series through latent masking
Shohreh Deldari, Dimitris Spathis, Mohammad Malekzadeh, Fahim Kawsar, Flora D Salim, and Akhil Mathur. Crossl: Cross-modal self-supervised learning for time-series through latent masking. In Proceedings of the 17th ACM Interna- tional Conference on Web Search and Data Mining , pages 152–160, 2024. 2
work page 2024
-
[10]
Seed: Self-supervised dis- tillation for visual representation
Zhiyuan Fang, Jianfeng Wang, Lijuan Wang, Lei Zhang, Yezhou Yang, and Zicheng Liu. Seed: Self-supervised dis- tillation for visual representation. International Conference on Learning Representations, 2021. 2, 3
work page 2021
-
[11]
Mantis: Lightweight calibrated foundation model for user-friendly time series classification
Vasilii Feofanov, Songkang Wen, Marius Alonso, Romain Ilbert, Hongbo Guo, Malik Tiomoko, Lujia Pan, Jianfeng Zhang, and Ievgen Redko. Mantis: Lightweight calibrated foundation model for user-friendly time series classification. arXiv preprint arXiv:2502.15637, 2025. 5, 6, 7
-
[12]
Jean-Yves Franceschi, Aymeric Dieuleveut, and Martin Jaggi. Unsupervised scalable representation learning for multivariate time series.Advances in neural information pro- cessing systems, 32, 2019. 4
work page 2019
-
[13]
Mmtsa: Multi- modal temporal segment attention network for efficient hu- man activity recognition
Ziqi Gao, Yuntao Wang, Jianguo Chen, Junliang Xing, Shwetak Patel, Xin Liu, and Yuanchun Shi. Mmtsa: Multi- modal temporal segment attention network for efficient hu- man activity recognition. Proceedings of the ACM on Inter- active, Mobile, Wearable and Ubiquitous Technologies, 7(3): 1–26, 2023. 3
work page 2023
-
[14]
Distilla- tion multiple choice learning for multimodal action recogni- tion
Nuno Cruz Garcia, Sarah Adel Bargal, Vitaly Ablavsky, Pietro Morerio, Vittorio Murino, and Stan Sclaroff. Distilla- tion multiple choice learning for multimodal action recogni- tion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2755–2764, 2021. 2
work page 2021
-
[15]
Mmg-ego4d: Multimodal generalization in egocentric action recognition
Xinyu Gong, Sreyas Mohan, Naina Dhingra, Jean-Charles Bazin, Yilei Li, Zhangyang Wang, and Rakesh Ranjan. Mmg-ego4d: Multimodal generalization in egocentric action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6481– 6491, 2023. 3
work page 2023
-
[16]
Moment: a fam- ily of open time-series foundation models
Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: a fam- ily of open time-series foundation models. In Proceedings of the 41st International Conference on Machine Learning . JMLR.org, 2024. 4, 5, 6, 7
work page 2024
-
[17]
The” something something” video database for learning and evaluating visual common sense
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on com- puter vision, pages 584...
work page 2017
-
[18]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18995–19012, 2022. 4
work page 2022
-
[19]
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19...
work page 2024
-
[20]
MiniLLM: Knowledge distillation of large language models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Repre- sentations, 2024. 7
work page 2024
-
[21]
Harish Haresamudram, Chi Ian Tang, Sungho Suh, Paul Lukowicz, and Thomas Ploetz. Past, present, and future of sensor-based human activity recognition using wearables: A surveying tutorial on a still challenging task. arXiv preprint arXiv:2411.14452, 2024. 2 9
-
[22]
Multimodal cross-domain few-shot learning for egocentric action recognition
Masashi Hatano, Ryo Hachiuma, Ryo Fujii, and Hideo Saito. Multimodal cross-domain few-shot learning for egocentric action recognition. In European Conference on Computer Vision, pages 182–199. Springer, 2024. 3
work page 2024
-
[23]
Momentum contrast for unsupervised visual rep- resentation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 9729–9738, 2020. 2, 3
work page 2020
-
[24]
Zhiqing Hong, Zelong Li, Shuxin Zhong, Wenjun Lyu, Hao- tian Wang, Yi Ding, Tian He, and Desheng Zhang. Crosshar: Generalizing cross-dataset human activity recognition via hi- erarchical self-supervised pretraining. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Tech- nologies, 8(2):1–26, 2024. 3
work page 2024
-
[25]
Hyeokhyen Kwon, Catherine Tong, Harish Haresamudram, Yan Gao, Gregory D Abowd, Nicholas D Lane, and Thomas Ploetz. Imutube: Automatic extraction of virtual on-body ac- celerometry from video for human activity recognition. Pro- ceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(3):1–29, 2020. 3
work page 2020
-
[26]
Imugpt 2.0: Language-based cross modality transfer for sensor-based human activity recognition
Zikang Leng, Amitrajit Bhattacharjee, Hrudhai Rajasekhar, Lizhe Zhang, Elizabeth Bruda, Hyeokhyen Kwon, and Thomas Pl¨otz. Imugpt 2.0: Language-based cross modality transfer for sensor-based human activity recognition. Pro- ceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 8(3):1–32, 2024. 3
work page 2024
-
[27]
Unmasked teacher: Towards training-efficient video foundation models
Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 19948–19960, 2023. 1
work page 2023
-
[28]
Sensorllm: Aligning large language models with motion sensors for human activity recognition
Zechen Li, Shohreh Deldari, Linyao Chen, Hao Xue, and Flora D Salim. Sensorllm: Aligning large language models with motion sensors for human activity recognition. arXiv preprint arXiv:2410.10624, 2024. 3
-
[29]
Congen: Unsupervised control and generalization distillation for sentence represen- tation
Peerat Limkonchotiwat, Wuttikorn Ponwitayarat, Lalita Lowphansirikul, Can Udomcharoenchaikit, Ekapol Chuang- suwanich, and Sarana Nutanong. Congen: Unsupervised control and generalization distillation for sentence represen- tation. In Findings of the Association for Computational Lin- guistics: EMNLP 2022, pages 6467–6480, 2022. 2, 3, 7
work page 2022
-
[30]
Spatial- temporal masked autoencoder for multi-device wearable hu- man activity recognition
Shenghuan Miao, Ling Chen, and Rong Hu. Spatial- temporal masked autoencoder for multi-device wearable hu- man activity recognition. Proceedings of the ACM on Inter- active, Mobile, Wearable and Ubiquitous Technologies, 7(4): 1–25, 2024. 3
work page 2024
-
[31]
Imu2clip: Language-grounded motion sensor translation with multi- modal contrastive learning
Seungwhan Moon, Andrea Madotto, Zhaojiang Lin, Apara- jita Saraf, Amy Bearman, and Babak Damavandi. Imu2clip: Language-grounded motion sensor translation with multi- modal contrastive learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13246– 13253, 2023. 2, 5, 6
work page 2023
-
[32]
Cross-modal knowledge distillation for vision- to-sensor action recognition
Jianyuan Ni, Raunak Sarbajna, Yang Liu, Anne HH Ngu, and Yan Yan. Cross-modal knowledge distillation for vision- to-sensor action recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4448–4452. IEEE, 2022. 2
work page 2022
-
[33]
Multimodal distillation for egocentric action recognition
Gorjan Radevski, Dusan Grujicic, Matthew Blaschko, Marie-Francine Moens, and Tinne Tuytelaars. Multimodal distillation for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 5213–5224, 2023. 2
work page 2023
-
[34]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PmLR, 2021. 2
work page 2021
-
[35]
Introducing a new bench- marked dataset for activity monitoring
Attila Reiss and Didier Stricker. Introducing a new bench- marked dataset for activity monitoring. In 2012 16th inter- national symposium on wearable computers, pages 108–109. IEEE, 2012. 1
work page 2012
-
[36]
Fitnets: Hints for thin deep nets
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. 8
work page 2015
-
[37]
Egodistill: Egocentric head motion distillation for efficient video understanding
Shuhan Tan, Tushar Nagarajan, and Kristen Grauman. Egodistill: Egocentric head motion distillation for efficient video understanding. Advances in Neural Information Pro- cessing Systems, 36:33485–33498, 2023. 2
work page 2023
-
[38]
Con- trastive representation distillation
Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con- trastive representation distillation. In International Confer- ence on Learning Representations, 2020. 2
work page 2020
-
[39]
Zero-shot learning for imu-based activity recognition using video em- beddings
Catherine Tong, Jinchen Ge, and Nicholas D Lane. Zero-shot learning for imu-based activity recognition using video em- beddings. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 5(4):1–23, 2021. 2
work page 2021
-
[40]
Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022. 1, 7
work page 2022
-
[41]
Timesnet: Temporal 2d- variation modeling for general time series analysis
Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet: Temporal 2d- variation modeling for general time series analysis. In In- ternational Conference on Learning Representations , 2023. 5, 6
work page 2023
-
[42]
Linfeng Xu, Qingbo Wu, Lili Pan, Fanman Meng, Hongliang Li, Chiyuan He, Hanxin Wang, Shaoxu Cheng, and Yu Dai. Towards continual egocentric activity recognition: A multi-modal egocentric activity dataset for continual learn- ing. IEEE Transactions on Multimedia , 26:2430–2443,
-
[43]
Multimodal knowledge expansion
Zihui Xue, Sucheng Ren, Zhengqi Gao, and Hang Zhao. Multimodal knowledge expansion. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 854–863, 2021. 2
work page 2021
-
[44]
The modality focusing hypothesis: Towards understanding cross- modal knowledge distillation
Zihui Xue, Zhengqi Gao, Sucheng Ren, and Hang Zhao. The modality focusing hypothesis: Towards understanding cross- modal knowledge distillation. In ICLR, 2023. 2
work page 2023
-
[45]
Ts2vec: To- wards universal representation of time series
Zhihan Yue, Yujing Wang, Juanyong Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong, and Bixiong Xu. Ts2vec: To- wards universal representation of time series. In Proceed- 10 ings of the AAAI conference on artificial intelligence , pages 8980–8987, 2022. 4
work page 2022
-
[46]
Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? In Pro- ceedings of the AAAI conference on artificial intelligence , pages 11121–11128, 2023. 5, 6
work page 2023
-
[47]
Usc-had: A daily ac- tivity dataset for ubiquitous activity recognition using wear- able sensors
Mi Zhang and Alexander A Sawchuk. Usc-had: A daily ac- tivity dataset for ubiquitous activity recognition using wear- able sensors. In Proceedings of the 2012 ACM conference on ubiquitous computing, pages 1036–1043, 2012. 1
work page 2012
-
[48]
Masked video and body-worn imu autoencoder for egocentric action recognition
Mingfang Zhang, Yifei Huang, Ruicong Liu, and Yoichi Sato. Masked video and body-worn imu autoencoder for egocentric action recognition. In European Conference on Computer Vision, pages 312–330. Springer, 2024. 3
work page 2024
-
[49]
Informer: Beyond efficient transformer for long sequence time-series forecast- ing
Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecast- ing. In Proceedings of the AAAI conference on artificial in- telligence, pages 11106–11115, 2021. 5, 6 11
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.