pith. sign in

arxiv: 2605.14014 · v2 · pith:3XSV7YHWnew · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Dywave: Event-Aligned Dynamic Tokenization for Heterogeneous IoT Sensing Signals

Pith reviewed 2026-05-20 20:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords dynamic tokenizationwavelet decompositionIoT sensing signalsevent alignmentsequence modelsactivity recognitioncomputational efficiencytemporal boundaries
0
0 comments X

The pith

Dywave aligns IoT sensing tokens to semantic events via wavelet decomposition, cutting token lengths by up to 75% while raising accuracy by up to 12%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents Dywave, a dynamic tokenization method designed for the non-stationary and multi-scale signals collected by IoT sensors. Standard fixed or uniform tokenization often fails to respect the natural temporal structures and physical events in these signals, which limits accuracy in tasks such as activity recognition and stress assessment. Dywave instead uses wavelet-based hierarchical decomposition to locate meaningful boundaries, then compresses redundant intervals while keeping temporal coherence intact. The result is shorter input sequences that still carry the essential information, leading to measurable gains in both performance and speed when fed to mainstream sequence models. A sympathetic reader would care because IoT deployments generate continuous heterogeneous data streams, and more efficient tokenization could make real-time analysis practical on devices with limited compute and power.

Core claim

Dywave constructs compact input representations aligned with intrinsic temporal structures and underlying physical events. It leverages wavelet-based hierarchical decomposition to identify meaningful temporal boundaries corresponding to underlying semantic events and adaptively compresses redundant intervals while preserving temporal coherence. Evaluations across five real-world IoT sensing datasets for activity recognition, stress assessment, and nearby object detection show that the resulting tokens improve accuracy by up to 12% and reduce input lengths by up to 75% when used with mainstream sequence models, while also increasing robustness to domain shifts and varying sequence lengths.

What carries the argument

Wavelet-based hierarchical decomposition that detects event-aligned temporal boundaries in heterogeneous sensing signals to enable adaptive compression of redundant intervals.

If this is right

  • Mainstream sequence models achieve up to 12% higher accuracy on activity recognition, stress assessment, and object detection tasks.
  • Input sequences shrink by up to 75%, directly lowering memory and compute requirements during inference.
  • The tokenization remains effective under domain shifts and across sequences of different lengths.
  • Gains appear consistently across multiple real-world IoT datasets and several standard sequence architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The unsupervised boundary detection could extend to other non-stationary time series such as audio or environmental monitoring streams.
  • Shorter tokenized inputs would lower energy use on battery-powered edge devices that run continuous sensing.
  • The same alignment principle might be combined with learned boundary predictors for cases where wavelets alone miss subtle events.
  • Large-scale unlabeled IoT collections could adopt this tokenization without the cost of event annotation.

Load-bearing premise

Wavelet-based hierarchical decomposition can reliably locate temporal boundaries that correspond to semantic events in the signals without any task-specific supervision or labeled annotations.

What would settle it

An IoT dataset in which the boundaries found by wavelet decomposition show no correspondence to actual changes in the underlying physical process, so that the dynamic tokens yield accuracy no higher than standard fixed-length tokenization.

Figures

Figures reproduced from arXiv: 2605.14014 by Denizhan Kara, Hongjue Zhao, Jinyang Li, Shengzhong Liu, Tarek Abdelzaher, Tomoyoshi Kimura, Xiaomin Ouyang, Yigong Hu, Yizhuo Chen.

Figure 1
Figure 1. Figure 1: Ego4D (HAR) raw signal examples. Signal events are manually annotated with red bounding boxes. using IMU signals, brief motion gestures (e.g., waving) may occur within a second, while complex activities (e.g., walking) can span tens of seconds and vary in intensity. Moreover, real-world signals exhibit highly irregular infor￾mation density, with quiescent intervals alternating with short bursts of salient … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Dywave. ferent users produce signals that vary greatly in temporal structure and intensity. To illustrate this variability, Fig￾ure 1 visualizes 30-second accelerometer samples from the Ego4D human activity recognition dataset (Grauman et al., 2022), comparing signals of cleaning activity across users and time periods, as well as the reading activity. Even within the same activity, signal patte… view at source ↗
Figure 3
Figure 3. Figure 3: Short-context performance vs. different parameters. is a fixed-length time-series segment used for short- and long-context classification. Baselines. We consider 5 baselines compatible with various backbones: PatchTST (Nie et al., 2023), DropPatch (Qiu et al., 2025), MedFormer (Wang et al., 2024), WaveTo￾ken (Masserano et al., 2025), and MultiPatch (Naghashi et al., 2025). We evaluate them using two sequen… view at source ↗
Figure 4
Figure 4. Figure 4: Short-context Accuracy vs. Token with the Transformer encoder. Accuracy F1 Score PAMAP2 0.65 0.70 0.75 0.80 0.85 Accuracy F1 Score RWHAR 0.70 0.75 0.80 0.85 0.90 PatchTST Acc Dywave Acc PatchTST Gyr Dywave Gyr PatchTST Dywave [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Multimodal classification accuracy. eters, requiring extensive grid search, while Dywave uses learnable, instance-specific segmentation. Moreover, Wavetoken performs notably worse than other baselines. Discretizing the input into quantized token IDs appears ill-suited for high-frequency sensing data with rich dynamics, as it disrupts the fine-grained amplitude and tem￾poral coherence essential for signal c… view at source ↗
Figure 6
Figure 6. Figure 6: Long-context classification performance with the Transformer encoder. (a) MOD - Audio (b) Ego4D - Accelerometer [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Long-context token distribution with the Transformer encoder. on Ego4D, where 30-second sequences contain multiple heterogeneous sub-events (posture transitions, hand-object interactions, environmental perturbations). Fixed-size tok￾enization mixes unrelated actions and obscures fine-grained transitions, while Dywave dynamically identifies semantic boundaries that align with activity transitions, enabling … view at source ↗
Figure 8
Figure 8. Figure 8: On-device (Raspberry Pi 4) Profiling. context settings but sacrifices efficiency with a much higher token count. In long-context settings (MOD audio), it re￾duces input length but suffers greater accuracy degradation. This implies non-anchor segments contain meaningful cues and should not be discarded. Dynamic fusion is crucial for achieving compact representations without sacrificing accu￾racy. Using spec… view at source ↗
Figure 9
Figure 9. Figure 9: Inference robustness with random noise injection. with heterogeneous dynamics, Dywave’s token compression substantially reduces backbone computation, and the advan￾tage grows with longer context windows or larger encoder models. This makes Dywave particularly well-suited for real-world deployments where signals are long, heteroge￾neous, and resource constraints are tight. 4.7. Inference Noise Robustness [… view at source ↗
Figure 10
Figure 10. Figure 10: Physics-Informed Hierarchical Embedding Module. Here, Wfj denote the discrete wavelet coefficients at level j ∈ [1, J], Aj the corresponding approximations, ehj and gej the rescaled wavelet and scaling filters, and Lj the effective filter length. Since MODWT is undecimated, both dXj and Aj preserve the full temporal resolution of the original sequence L. The recursive formulation produces a hierarchy of a… view at source ↗
Figure 11
Figure 11. Figure 11: Temporal Anchor Formation Module. capture abrupt, high-frequency transients such as wrist flicks or foot impacts, while the context embedding EV interprets these as transitions between broader activity phases, such as moving from walking to standing or from wiping to resting. To integrate these complementary views, we fuse the two embeddings into a unified hierarchical embedding: E F = E U ||E V , EF ∈ R … view at source ↗
Figure 12
Figure 12. Figure 12: Temporal Fusion Module. uniform patching wastes computation on such redundant information. Dynamic temporal fusion addresses this by adaptively compressing coherent regions while preserving semantic integrity at event boundaries [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Sensitivity analysis on anchor budget. 0.00 0.25 0.50 0.75 1.00 rec 0.83 0.84 0.85 0.86 Accuracy RWHAR-Accuracy 0.00 0.25 0.50 0.75 1.00 rec 30 40 50 60 # Tokens RWHAR-#Tokens 0.00 0.25 0.50 0.75 1.00 rec 0.74 0.76 0.78 0.80 0.82 Accuracy PAMAP2-Accuracy 0.00 0.25 0.50 0.75 1.00 rec 2.0 2.2 2.4 2.6 # Tokens PAMAP2-#Tokens [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Sensitivity analysis on reconstruction loss λrec. reduction in input sequence length. The gap is more significant with long-context inputs in Ego4D. In these scenarios, PatchTST produces hundreds of tokens, leading to rapidly increasing latency with sequence length. In contrast, Dywave adaptively compresses long stationary regions into a small number of semantically coherent input tokens with up to an ord… view at source ↗
Figure 15
Figure 15. Figure 15: Ego4D Boundary Visualization. Signal events are manually annotated with red bounding boxes. different users perform the same activity (row 2), with user-dependent rhythms and intensities. Comparing different activities (row 3) further highlights changes in temporal density and dynamic range, reflecting the inherent heterogeneity of real-world motion across the samples. Under such diverse conditions, token… view at source ↗
Figure 16
Figure 16. Figure 16: Example of micro-activity decomposition with Dywave on Ego4D. conducts a qualitative case study of Dywave’s capability in mitigating this challenge on the Ego4D dataset (Grauman et al., 2022), which provides synchronized egocentric video and IMU signals during daily activities such as cooking, crafting, and household management. We extract 15-second continuous IMU segments and apply Dywave to the accelero… view at source ↗
read the original abstract

Internet of Things (IoT) systems continuously collect heterogeneous sensing signals from ubiquitous sensors to support intelligent applications such as human activity analysis, emotion monitoring, and environmental perception. These signals are inherently non-stationary and multi-scale, posing unique challenges for standard tokenization techniques. This paper proposes Dywave, a dynamic tokenization framework for IoT sensing signals that constructs compact input representations aligned with intrinsic temporal structures and underlying physical events. Dywave leverages wavelet-based hierarchical decomposition, identifies meaningful temporal boundaries corresponding to underlying semantic events, and adaptively compresses redundant intervals while preserving temporal coherence. Extensive evaluations on five real-world IoT sensing datasets across activity recognition, stress assessment, and nearby object detection demonstrate that Dywave outperforms state-of-the-art methods by up to 12% in accuracy, while improving computational efficiency by reducing input token lengths by up to 75% across mainstream sequence models. Moreover, Dywave exhibits improved robustness to domain shifts and varying sequence lengths.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes Dywave, a dynamic tokenization framework for heterogeneous IoT sensing signals. It employs wavelet-based hierarchical decomposition to detect temporal boundaries aligned with intrinsic structures and underlying semantic events, adaptively compressing redundant intervals while preserving coherence. On five real-world datasets spanning activity recognition, stress assessment, and object detection, Dywave is reported to improve accuracy by up to 12% and reduce token lengths by up to 75% relative to prior methods when paired with standard sequence models, with added robustness to domain shifts.

Significance. If the claimed semantic-event alignment holds and explains the gains beyond generic compression, the work could meaningfully advance efficient tokenization for non-stationary sensor streams in resource-limited IoT settings. The breadth of datasets and models tested provides a reasonable empirical foundation, though the source of the reported improvements requires clearer isolation.

major comments (1)
  1. [Abstract] Abstract: the central attribution of up to 12% accuracy gains and 75% token-length reduction to 'meaningful temporal boundaries corresponding to underlying semantic events' identified without supervision is load-bearing. No alignment metric, comparison against ground-truth event labels, or ablation against random or non-semantic adaptive boundaries is described, leaving open that gains may derive from variable-length compression alone rather than semantic correspondence.
minor comments (2)
  1. [Abstract] The abstract refers to 'five real-world IoT sensing datasets' without naming them; explicit dataset citations and characteristics would improve reproducibility.
  2. [Methodology] Clarify the specific wavelet family, decomposition depth selection criterion, and any hyperparameters governing boundary detection to support exact replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and describe the revisions we will make to strengthen the empirical isolation of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central attribution of up to 12% accuracy gains and 75% token-length reduction to 'meaningful temporal boundaries corresponding to underlying semantic events' identified without supervision is load-bearing. No alignment metric, comparison against ground-truth event labels, or ablation against random or non-semantic adaptive boundaries is described, leaving open that gains may derive from variable-length compression alone rather than semantic correspondence.

    Authors: We appreciate the referee's observation that the attribution to unsupervised semantic-event alignment is central and requires stronger isolation from generic compression effects. Our wavelet hierarchical decomposition identifies boundaries via multi-scale energy thresholding and coefficient persistence, which we posit capture intrinsic signal transients corresponding to physical events; however, we agree that this requires explicit validation beyond the current results. In the revised manuscript we will add a dedicated ablation section that (i) compares Dywave against random boundary sampling drawn from the same length distribution, (ii) uniform fixed-length segmentation, and (iii) non-wavelet adaptive methods such as entropy-based or change-point detection without semantic priors. All ablations will be evaluated on the same five datasets and downstream models, reporting both accuracy and token-length metrics. Where activity-transition timestamps are available as proxy ground truth (e.g., in the activity-recognition datasets), we will also report boundary alignment metrics such as precision/recall of detected events against annotated intervals. These additions will clarify whether the observed gains exceed those attributable to variable-length compression alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents Dywave as a wavelet-based dynamic tokenization method evaluated empirically on five external real-world IoT datasets for tasks like activity recognition. No load-bearing steps reduce by construction to fitted parameters from the target data, self-citations, or definitional renaming; performance gains and token reductions are reported as outcomes of the proposed decomposition applied to standard sequence models. The derivation chain remains self-contained against external benchmarks with no evidence of the patterns that would indicate circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. Typical implicit assumptions include that wavelet scales capture semantic events and that compression preserves all task-relevant information.

pith-pipeline@v0.9.0 · 5734 in / 1115 out tokens · 40625 ms · 2026-05-20T20:31:34.091409+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · 3 internal anchors

  1. [1]

    R., Smith, N

    Ahia, O., Kumar, S., Gonen, H., Kasai, J., Mortensen, D. R., Smith, N. A., and Tsvetkov, Y. Do all languages cost the same? tokenization in the era of commercial language models

  2. [2]

    K., and Alshurafa, N

    Alharbi, R., Shahi, S., Cruz, S., Li, L., Sen, S., Pedram, M., Romano, C., Hester, J., Katsaggelos, A. K., and Alshurafa, N. Smokemon: unobtrusive extraction of smoking topography using wearable energy-efficient thermal. Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, 6 0 (4): 0 1--25, 2023

  3. [3]

    F., Stella, L., Turkmen, A

    Ansari, A. F., Stella, L., Turkmen, A. C., Zhang, X., Mercado, P., Shen, H., Shchur, O., Rangapuram, S. S., Arango, S. P., Kapoor, S., et al. Chronos: Learning the language of time series. Transactions on Machine Learning Research

  4. [4]

    Foundation models for cps-iot: Opportunities and challenges

    Baris, O., Chen, Y., Dong, G., Han, L., Kimura, T., Quan, P., Wang, R., Wang, T., Abdelzaher, T., Berg \'e s, M., et al. Foundation models for cps-iot: Opportunities and challenges. arXiv preprint arXiv:2501.16368, 2025

  5. [5]

    A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M

    Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. On the opportunities and risks of foundation models. arXiv e-prints, pp.\ arXiv--2108, 2021

  6. [6]

    O., Pfister, T., Zheng, Y., Ye, W., and Liu, Y

    Cao, D., Jia, F., Arik, S. O., Pfister, T., Zheng, Y., Ye, W., and Liu, Y. Tempo: Prompt-based generative pre-trained transformer for time series forecasting. In The Twelfth International Conference on Learning Representations

  7. [7]

    Mspatch: A multi-scale patch mixing framework for multivariate time series forecasting

    Cao, Y., Tian, Z., Guo, W., and Liu, X. Mspatch: A multi-scale patch mixing framework for multivariate time series forecasting. Expert Systems with Applications, 273: 0 126849, 2025

  8. [8]

    Llm4ts: Aligning pre-trained llms as data-efficient time-series forecasters

    Chang, C., Wang, W.-Y., Peng, W.-C., and Chen, T.-F. Llm4ts: Aligning pre-trained llms as data-efficient time-series forecasters. ACM Transactions on Intelligent Systems and Technology, 16 0 (3): 0 1--20, 2025

  9. [9]

    L., Akther, S., Ertin, E., Fagundes, C

    Chatterjee, S., Moreno, A., Lizotte, S. L., Akther, S., Ertin, E., Fagundes, C. P., Lam, C., Rehg, J. M., Wan, N., Wetter, D. W., et al. Smokingopp: Detecting the smoking'opportunity'context using mobile sensors. Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, 4 0 (1): 0 1--26, 2020

  10. [10]

    Hmgan: A hierarchical multi-modal generative adversarial network model for wearable human activity recognition

    Chen, L., Hu, R., Wu, M., and Zhou, X. Hmgan: A hierarchical multi-modal generative adversarial network model for wearable human activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 7 0 (3): 0 1--27, 2023

  11. [11]

    and Gu, A

    Dao, T. and Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. In International Conference on Machine Learning, pp.\ 10041--10071. PMLR, 2024

  12. [12]

    A decoder-only foundation model for time-series forecasting

    Das, A., Kong, W., Sen, R., and Zhou, Y. A decoder-only foundation model for time-series forecasting. In Forty-first International Conference on Machine Learning, 2024

  13. [13]

    V., and Salim, F

    Deldari, S., Xue, H., Saeed, A., Smith, D. V., and Salim, F. D. Cocoa: Cross modality contrastive learning for sensor data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6 0 (3): 0 1--28, 2022

  14. [14]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020

  15. [15]

    Tsmixer: Lightweight mlp-mixer model for multivariate time series forecasting

    Ekambaram, V., Jati, A., Nguyen, N., Sinthong, P., and Kalagnanam, J. Tsmixer: Lightweight mlp-mixer model for multivariate time series forecasting. In Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, pp.\ 459--469, 2023

  16. [16]

    E., Chang, C.-C., Xu, X

    Englhardt, Z., Ma, C., Morris, M. E., Chang, C.-C., Xu, X. O., Qin, L., McDuff, D., Liu, X., Patel, S., and Iyer, V. From classification to clinical insights: Towards analyzing and reasoning about mobile and behavioral health data with large language models. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 8 0 (2): 0 1-...

  17. [17]

    Mmtsa: Multi-modal temporal segment attention network for efficient human activity recognition

    Gao, Z., Wang, Y., Chen, J., Xing, J., Patel, S., Liu, X., and Shi, Y. Mmtsa: Multi-modal temporal segment attention network for efficient human activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 7 0 (3): 0 1--26, 2023

  18. [18]

    o tz, L., Kollovieh, M., G \

    G \"o tz, L., Kollovieh, M., G \"u nnemann, S., and Schwinn, L. Byte pair encoding for efficient time series forecasting. arXiv preprint arXiv:2505.14411, 2025

  19. [19]

    An introduction to wavelets

    Graps, A. An introduction to wavelets. IEEE computational science and engineering, 2 0 (2): 0 50--61, 1995

  20. [20]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 18995--19012, 2022

  21. [21]

    Deep residual learning for image recognition

    He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp.\ 770--778, 2016

  22. [22]

    Openmae: efficient masked autoencoder for vibration sensing with open-domain data enrichment

    Hu, C., Chen, Y., Kara, D., Liu, S., Abdelzaher, T., Wu, F., and Chen, G. Openmae: efficient masked autoencoder for vibration sensing with open-domain data enrichment. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 9 0 (2): 0 1--29, 2025

  23. [23]

    Y., Shi, X., Chen, P.-Y., Liang, Y., Li, Y.-F., Pan, S., et al

    Jin, M., Wang, S., Ma, L., Chu, Z., Zhang, J. Y., Shi, X., Chen, P.-Y., Liang, Y., Li, Y.-F., Pan, S., et al. Time-llm: Time series forecasting by reprogramming large language models. In The Twelfth International Conference on Learning Representations

  24. [24]

    Phymask: An adaptive masking paradigm for efficient self-supervised learning in iot

    Kara, D., Kimura, T., Chen, Y., Li, J., Wang, R., Chen, Y., Wang, T., Liu, S., and Abdelzaher, T. Phymask: An adaptive masking paradigm for efficient self-supervised learning in iot. In Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems, pp.\ 97--111, 2024 a

  25. [25]

    Freqmae: Frequency-aware masked autoencoder for multi-modal iot sensing

    Kara, D., Kimura, T., Shengzhong, L., Jinyang, L., Dongxin, L., Tianshi, W., Ruijie, W., Yizhuo, C., Yigong, H., and Tarek, A. Freqmae: Frequency-aware masked autoencoder for multi-modal iot sensing. In The World Wide Web Conference, 2024 b

  26. [26]

    Estimating sampling rate of human activity data from accelerometer using transformer-based regression model

    Kawano, H., Okamoto, M., and Murao, K. Estimating sampling rate of human activity data from accelerometer using transformer-based regression model. In Adjunct Proceedings of the 2023 ACM International Joint Conference on Pervasive and Ubiquitous Computing & the 2023 ACM International Symposium on Wearable Computing, pp.\ 200--201, 2023

  27. [27]

    What and when to explain? on-road evaluation of explanations in highly automated vehicles

    Kim, G., Yeo, D., Jo, T., Rus, D., and Kim, S. What and when to explain? on-road evaluation of explanations in highly automated vehicles. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 7 0 (3): 0 1--26, 2023

  28. [28]

    Vibrofm: Towards micro foundation models for robust multimodal iot sensing

    Kimura, T., Li, J., Wang, T., Chen, Y., Wang, R., Kara, D., Wigness, M., Bhattacharyya, J., Srivatsa, M., Liu, S., et al. Vibrofm: Towards micro foundation models for robust multimodal iot sensing. In 2024 IEEE 21st International Conference on Mobile Ad-Hoc and Smart Systems (MASS), pp.\ 10--18. IEEE, 2024

  29. [29]

    Infomae: Pair-efficient cross-modal alignment for multimodal time-series sensing signals

    Kimura, T., Li, X., Hanna, O., Chen, Y., Chen, Y., Kara, D., Wang, T., Li, J., Ouyang, X., Liu, S., et al. Infomae: Pair-efficient cross-modal alignment for multimodal time-series sensing signals. In Proceedings of the ACM on Web Conference 2025, pp.\ 3084--3095, 2025

  30. [30]

    Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  31. [31]

    R., Cai, H., and Mostofi, Y

    Korany, B., Karanam, C. R., Cai, H., and Mostofi, Y. Xmodal-id: Using wifi for through-wall person identification from candidate video footage. In The 25th Annual International Conference on Mobile Computing and Networking, MobiCom '19, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450361699. doi:10.1145/3300061.3345437. URL https...

  32. [32]

    and Richardson, J

    Kudo, T. and Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. EMNLP 2018, pp.\ 66, 2018

  33. [33]

    F., Morettin, P

    Larrubia, L. F., Morettin, P. A., and Chiann, C. The maximal overlap discrete wavelet scattering transform and its application in classification tasks. arXiv preprint arXiv:2506.12039, 2025

  34. [34]

    Pywavelets: A python package for wavelet analysis

    Lee, G., Gommers, R., Waselewski, F., Wohlfahrt, K., and O'Leary, A. Pywavelets: A python package for wavelet analysis. Journal of Open Source Software, 4 0 (36): 0 1237, 2019

  35. [35]

    and Mayrand, M

    Lina, J.-M. and Mayrand, M. Complex daubechies wavelets. Applied and Computational Harmonic Analysis, 2 0 (3): 0 219--229, 1995

  36. [36]

    Focal: Contrastive learning for multimodal time-series sensing signals in factorized orthogonal latent space

    Liu, S., Kimura, T., Liu, D., Wang, R., Li, J., Diggavi, S., Srivastava, M., and Abdelzaher, T. Focal: Contrastive learning for multimodal time-series sensing signals in factorized orthogonal latent space. Advances in Neural Information Processing Systems, 36, 2023

  37. [37]

    F., Han, B., Zhang, X., Faloutsos, C., Mahoney, M

    Masserano, L., Ansari, A. F., Han, B., Zhang, X., Faloutsos, C., Mahoney, M. W., Wilson, A. G., Park, Y., Rangapuram, S. S., Maddix, D. C., et al. Enhancing foundation models for time series forecasting via wavelet-based tokenization. In Forty-second International Conference on Machine Learning, 2025

  38. [38]

    Naghashi, V., Boukadoum, M., and Diallo, A. B. A multiscale model for multivariate time series forecasting. Scientific Reports, 15 0 (1): 0 1565, 2025

  39. [39]

    a rv \"a inen, J., Pettersson, K., and M \

    Nath, R. K., Tervonen, J., N \"a rv \"a inen, J., Pettersson, K., and M \"a ntyj \"a rvi, J. Towards self-supervised learning of ecg signal representation for the classification of acute stress types. In Proceedings of the Great Lakes Symposium on VLSI 2023, pp.\ 85--90, 2023

  40. [40]

    Hierarchical transformers are more efficient language models

    Nawrot, P., Tworkowski, S., Tyrolski, M., Kaiser, ., Wu, Y., Szegedy, C., and Michalewski, H. Hierarchical transformers are more efficient language models

  41. [41]

    H., Sinthong, P., and Kalagnanam, J

    Nie, Y., Nguyen, N. H., Sinthong, P., and Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. In The Eleventh International Conference on Learning Representations, 2023

  42. [42]

    W., Xie, Z., Xing, G., and Huang, J

    Ouyang, X., Shuai, X., Zhou, J., Shi, I. W., Xie, Z., Xing, G., and Huang, J. Cosmo: Contrastive fusion learning with small data for multimodal human activity recognition. In International Conference on Mobile Computing And Networking (MobiCom), 2022

  43. [43]

    Percival, D. B. and Walden, A. T. Wavelet methods for time series analysis, volume 4. Cambridge university press, 2000

  44. [44]

    Language model tokenizers introduce unfairness between languages

    Petrov, A., La Malfa, E., Torr, P., and Bibi, A. Language model tokenizers introduce unfairness between languages. Advances in neural information processing systems, 36: 0 36963--36990, 2023

  45. [45]

    Fredformer: Frequency debiased transformer for time series forecasting

    Piao, X., Chen, Z., Murayama, T., Matsubara, Y., and Sakurai, Y. Fredformer: Frequency debiased transformer for time series forecasting. In Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pp.\ 2400--2410, 2024

  46. [46]

    Enhancing masked time-series modeling via dropping patches

    Qiu, T., Xie, Y., Niu, H., Xiong, Y., and Gao, X. Enhancing masked time-series modeling via dropping patches. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 20077--20085, 2025

  47. [47]

    Dynamicvit: Efficient vision transformers with dynamic token sparsification

    Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., and Hsieh, C.-J. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34: 0 13937--13949, 2021

  48. [48]

    and Stricker, D

    Reiss, A. and Stricker, D. Introducing a new benchmarked dataset for activity monitoring. In International Symposium on Wearable Computers (ISWC), 2012

  49. [49]

    Motion2press: Cross model learning from imu to plantar pressure for gait analysis

    Ren, J., Zheng, R., Zhang, W., She, D., Bai, Y., Jin, Z., and Gao, Y. Motion2press: Cross model learning from imu to plantar pressure for gait analysis. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 9 0 (3): 0 1--33, 2025

  50. [50]

    Tokenlearner: Adaptive space-time tokenization for videos

    Ryoo, M., Piergiovanni, A., Arnab, A., Dehghani, M., and Angelova, A. Tokenlearner: Adaptive space-time tokenization for videos. Advances in neural information processing systems, 34: 0 12786--12797, 2021

  51. [51]

    A., Mao, W., Neupane, S., Rehg, J

    Saha, M., Xu, M. A., Mao, W., Neupane, S., Rehg, J. M., and Kumar, S. Pulse-ppg: An open-source field-trained ppg foundation model for wearable applications across lab and field settings. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 9 0 (3): 0 1--35, 2025

  52. [52]

    Introducing wesad, a multimodal dataset for wearable stress and affect detection

    Schmidt, P., Reiss, A., Duerichen, R., Marberger, C., and Van Laerhoven, K. Introducing wesad, a multimodal dataset for wearable stress and affect detection. In Proceedings of the 20th ACM international conference on multimodal interaction, pp.\ 400--408, 2018 a

  53. [53]

    Schmidt, P., Reiss, A., D \" u richen, R., Marberger, C., and Laerhoven, K. V. Introducing wesad, a multimodal dataset for wearable stress and affect detection. In ICMI 2018, pp.\ 400--408. ACM , 2018 b . doi:10.1145/3242969.3242985

  54. [54]

    Neural machine translation of rare words with subword units

    Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1715--1725, 2016

  55. [55]

    S., Jiang, X., and Mesgarani, N

    Shams, S., Dindar, S. S., Jiang, X., and Mesgarani, N. Ssamba: Self-supervised audio representation learning with mamba state space model. In 2024 IEEE Spoken Language Technology Workshop (SLT), pp.\ 1053--1059. IEEE, 2024

  56. [56]

    and Stuckenschmidt, H

    Sztyler, T. and Stuckenschmidt, H. On-body localization of wearable devices: An investigation of position-aware activity recognition. In IEEE International Conference on Pervasive Computing and Communications (PerCom), 2016

  57. [57]

    Scaling laws with vocabulary: Larger models deserve larger vocabularies

    Tao, C., Liu, Q., Dou, L., Muennighoff, N., Wan, Z., Luo, P., Lin, M., and Wong, N. Scaling laws with vocabulary: Larger models deserve larger vocabularies. Advances in Neural Information Processing Systems, 37: 0 114147--114179, 2024

  58. [58]

    Selective review of offline change point detection methods

    Truong, C., Oudre, L., and Vayatis, N. Selective review of offline change point detection methods. Signal processing, 167: 0 107299, 2020

  59. [59]

    A., Chatterjee, S., Fagundes, C

    Ullah, M. A., Chatterjee, S., Fagundes, C. P., Lam, C., Nahum-Shani, I., Rehg, J. M., Wetter, D. W., and Kumar, S. mrisk: continuous risk estimation for smoking lapse from noisy sensor data with incomplete and positive-only labels. Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, 6 0 (3): 0 1--29, 2022

  60. [60]

    N., Kaiser, ., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp.\ 5998--6008, 2017

  61. [61]

    Loear: Push the range limit of acoustic sensing for vital sign monitoring

    Wang, L., Li, W., Sun, K., Zhang, F., Gu, T., Xu, C., and Zhang, D. Loear: Push the range limit of acoustic sensing for vital sign monitoring. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6 0 (3): 0 1--24, 2022

  62. [62]

    Contrastive learning of stress-specific word embedding for social media based stress detection

    Wang, X., Zhang, H., Cao, L., Zeng, K., Li, Q., Li, N., and Feng, L. Contrastive learning of stress-specific word embedding for social media based stress detection. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.\ 5137--5149, 2023 a

  63. [63]

    Medformer: A multi-granularity patching transformer for medical time-series classification

    Wang, Y., Huang, N., Li, T., Yan, Y., and Zhang, X. Medformer: A multi-granularity patching transformer for medical time-series classification. Advances in Neural Information Processing Systems, 37: 0 36314--36341, 2024

  64. [64]

    Lightgts: A lightweight general time series forecasting model

    Wang, Y., Qiu, Y., Chen, P., Shu, Y., Rao, Z., Pan, L., Yang, B., and Guo, C. Lightgts: A lightweight general time series forecasting model. In International Conference on Machine Learning, pp.\ 64109--64126. PMLR, 2025

  65. [65]

    Hearfire: Indoor fire detection via inaudible acoustic sensing

    Wang, Z., Wang, Y., Tian, M., and Shen, J. Hearfire: Indoor fire detection via inaudible acoustic sensing. Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, 6 0 (4): 0 1--25, 2023 b

  66. [66]

    Deepsense: A unified deep learning framework for time-series mobile sensing data processing

    Yao, S., Hu, S., Zhao, Y., Zhang, A., and Abdelzaher, T. Deepsense: A unified deep learning framework for time-series mobile sensing data processing. In International Conference on World Wide Web (WWW), 2017

  67. [67]

    Stfnets: Learning sensing signals from the time-frequency perspective with short-time fourier neural networks

    Yao, S., Piao, A., Jiang, W., Zhao, Y., Shao, H., Liu, S., Liu, D., Li, J., Wang, T., Hu, S., et al. Stfnets: Learning sensing signals from the time-frequency perspective with short-time fourier neural networks. In The World Wide Web Conference, pp.\ 2192--2202, 2019

  68. [68]

    Frequency-domain mlps are more effective learners in time series forecasting

    Yi, K., Zhang, Q., Fan, W., Wang, S., Wang, P., He, H., An, N., Lian, D., Cao, L., and Niu, Z. Frequency-domain mlps are more effective learners in time series forecasting. Advances in Neural Information Processing Systems, 36: 0 76656--76679, 2023

  69. [69]

    and Sano, A

    Yu, H. and Sano, A. Semi-supervised learning for wearable-based momentary stress detection in the wild. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 7 0 (2): 0 1--23, 2023

  70. [70]

    M., Chee, M., Shenoy, P., and Balan, R

    Zakaria, C., Yilmaz, G., Mammen, P. M., Chee, M., Shenoy, P., and Balan, R. Sleepmore: Inferring sleep duration at scale via multi-device wifi sensing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6 0 (4): 0 1--32, 2023

  71. [71]

    Self-supervised contrastive pre-training for time series via time-frequency consistency

    Zhang, X., Zhao, Z., Tsiligkaridis, T., and Zitnik, M. Self-supervised contrastive pre-training for time series via time-frequency consistency. In Neural Information Processing Systems (NeurIPS), 2022

  72. [72]

    A., Narayanswamy, G., Xu, M

    Zhang, Y., Ayush, K., Qiao, S., Heydari, A. A., Narayanswamy, G., Xu, M. A., Metwally, A., Xu, J., Garrison, J., Xu, X., Althoff, T., Liu, Y., Kohli, P., Zhan, J., Malhotra, M., Patel, S., Mascolo, C., Liu, X., McDuff, D., and Yang, Y. Sensor LM : Learning the language of wearable sensors. In The Thirty-ninth Annual Conference on Neural Information Proces...

  73. [73]

    Segall: A unified active learning framework for wireless sensing data segmentation

    Zheng, N., Liu, R., Fan, X., Zhang, C., Zhang, L., and Yin, Z. Segall: A unified active learning framework for wireless sensing data segmentation. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 9 0 (3): 0 1--27, 2025

  74. [74]

    Zhong, S., Song, S., Zhuo, W., Li, G., Liu, Y., and Chan, S.-H. G. A multi-scale decomposition mlp-mixer for time series analysis. Proceedings of the VLDB Endowment, 17 0 (7): 0 1723--1736, 2024

  75. [75]

    One fits all: Power general time series analysis by pretrained lm

    Zhou, T., Niu, P., Sun, L., Jin, R., et al. One fits all: Power general time series analysis by pretrained lm. Advances in neural information processing systems, 36: 0 43322--43355, 2023

  76. [76]

    Scalemixer: A multi-scale mlp-mixer model for long-term time series forecasting

    Zou, X., You, C., Zhao, R., Yang, H., and Cheng, X. Scalemixer: A multi-scale mlp-mixer model for long-term time series forecasting. In International Conference on Neural Information Processing, pp.\ 44--58. Springer, 2024

  77. [77]

    12th \ USENIX \ Symposium on Operating Systems Design and Implementation ( \ OSDI \ 16) , pages =

    Tensorflow: A system for large-scale machine learning , author =. 12th \ USENIX \ Symposium on Operating Systems Design and Implementation ( \ OSDI \ 16) , pages =

  78. [78]

    Computer , publisher =

    Toward an internet of battlefield things: A resilience perspective , author =. Computer , publisher =

  79. [79]

    ACM Transactions on Internet Technology (TOIT) , publisher =

    Five challenges in cloud-enabled intelligence and control , author =. ACM Transactions on Internet Technology (TOIT) , publisher =

  80. [80]

    2017 International Conference on Engineering and Technology (ICET) , volume =

    Understanding of a convolutional neural network , author =. 2017 International Conference on Engineering and Technology (ICET) , volume =

Showing first 80 references.