pith. machine review for the scientific record. sign in

arxiv: 2604.10544 · v1 · submitted 2026-04-12 · 💻 cs.LG · cs.AI

Recognition: unknown

WaveMoE: A Wavelet-Enhanced Mixture-of-Experts Foundation Model for Time Series Forecasting

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords time series forecastingmixture of expertswavelet transformfoundation modelsfrequency domaindual-path architectureexpert routing
0
0 comments X

The pith

WaveMoE processes time series and wavelet tokens together through shared mixture-of-experts routing to capture frequency patterns in forecasting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WaveMoE as a way to add explicit frequency-domain information to large time series foundation models. It does this with a dual-path design in which raw time tokens and wavelet-derived tokens are aligned on the same time axis and routed to the same pool of experts. The goal is to let the model handle both everyday trends and periodic or bursty high-frequency behaviors without separate preprocessing stages. Early tests across 16 varied benchmark datasets point to measurable gains once wavelet-domain data is folded into the pretraining corpus.

Core claim

WaveMoE adopts a dual-path architecture that jointly processes time series tokens and wavelet tokens aligned along a unified temporal axis, and coordinates them through a shared expert routing mechanism that enables consistent expert specialization while efficiently scaling model capacity. This setup supplies explicit frequency-domain representations that help model periodicity and localized high-frequency dynamics prevalent in real-world series.

What carries the argument

Dual-path architecture with time series tokens and wavelet tokens aligned on a shared temporal axis and dispatched by the same mixture-of-experts router.

If this is right

  • Explicit wavelet inputs allow the model to represent periodic and localized high-frequency structures more directly than time-domain tokens alone.
  • Shared expert routing maintains specialization across domains while still permitting model capacity to grow.
  • Adding wavelet-domain corpora to pretraining yields measurable forecasting improvements on diverse benchmarks.
  • The approach avoids the need for separate frequency preprocessing pipelines by embedding the information inside the foundation model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment-and-routing pattern could be tested with other frequency decompositions such as Fourier or empirical mode decomposition.
  • Consistent cross-domain experts might reduce the data hunger of future time-series foundation models by letting one expert pool serve multiple signal representations.
  • Long-horizon forecasts, where repeating cycles dominate, are a natural next test bed for the dual-path idea.
  • If the gains hold, foundation-model designers may shift from purely learned embeddings toward explicit multi-domain token streams.

Load-bearing premise

Routing wavelet tokens through the identical experts as time tokens will produce consistent specialization and net forecasting gains rather than conflicting signals.

What would settle it

Retraining the identical model on the same 16 datasets but removing the wavelet path or breaking the temporal alignment and measuring no accuracy lift or a drop would falsify the claimed benefit of the dual-path design.

Figures

Figures reproduced from arXiv: 2604.10544 by Boxin Li, Dan Li, Erli Meng, Jian Lou, Jiawei Huang, See-kiong Ng, Shunyu Wu, Weibin Feng, Xiao Zhang.

Figure 1
Figure 1. Figure 1: Overview of the proposed WaveMoE model. 3 METHODOLOGY We propose WaveMoE, a wavelet-enhanced mixture-of-experts foundation model for time series forecasting. Given an input sequence x1:T , the objective is to predict a future horizon xT +1:T +H. For multivariate time series, we adopt a channel-independent strategy that decomposes multivariate inputs into univariate series, enabling scalable and flexible mo… view at source ↗
Figure 2
Figure 2. Figure 2: Forecast comparisons between WaveMoE and Time-MoE across representative datasets. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Forecast comparisons between WaveMoE and Chronos across representative datasets. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Forecast comparisons between WaveMoE and Timer across representative datasets. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

Time series foundation models (TSFMs) have recently achieved remarkable success in universal forecasting by leveraging large-scale pretraining on diverse time series data. Complementing this progress, incorporating frequency-domain information yields promising performance in enhancing the modeling of complex temporal patterns, such as periodicity and localized high-frequency dynamics, which are prevalent in real-world time series. To advance this direction, we propose a new perspective that integrates explicit frequency-domain representations into scalable foundation models, and introduce WaveMoE, a wavelet-enhanced mixture-of-experts foundation model for time series forecasting. WaveMoE adopts a dual-path architecture that jointly processes time series tokens and wavelet tokens aligned along a unified temporal axis, and coordinates them through a shared expert routing mechanism that enables consistent expert specialization while efficiently scaling model capacity. Preliminary experimental results on 16 diverse benchmark datasets indicate that WaveMoE has the potential to further improve forecasting performance by incorporating wavelet-domain corpora.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces WaveMoE, a wavelet-enhanced mixture-of-experts foundation model for time series forecasting. It proposes a dual-path architecture that jointly processes time-series tokens and wavelet tokens aligned along a unified temporal axis, coordinated via a shared expert routing mechanism intended to enable consistent expert specialization while scaling model capacity. The authors report that preliminary results on 16 diverse benchmark datasets indicate potential improvements in forecasting performance through the incorporation of wavelet-domain information.

Significance. If the empirical claims hold under rigorous evaluation, the work could advance time-series foundation models by providing a scalable mechanism to integrate explicit frequency-domain representations without separate model pathways. The shared-routing design for dual-domain tokens is a conceptually interesting approach to multi-scale temporal modeling that, if validated, might generalize beyond wavelets.

major comments (3)
  1. [Abstract] Abstract: The central claim that WaveMoE 'has the potential to further improve forecasting performance' rests on 'preliminary experimental results on 16 diverse benchmark datasets,' yet no quantitative metrics, baseline comparisons, ablation studies, error bars, or statistical tests are supplied. This absence leaves the performance contribution unsupported and directly undermines assessment of the architecture's value.
  2. [Methods] Methods (architecture description): The dual-path design asserts that wavelet tokens are 'aligned along a unified temporal axis' and routed through the same experts as time tokens, but no equations, pseudocode, or details are given for the wavelet decomposition, token projection, temporal alignment procedure, or the shared router formulation (e.g., gating function or expert activation). Without these, it is impossible to verify whether alignment preserves scale-specific information or whether shared routing produces coherent specialization rather than interference.
  3. [Experiments] Experiments: No details are provided on model scale, pretraining corpus, comparison models, or any analysis of expert routing behavior (e.g., activation histograms or specialization metrics across time vs. wavelet tokens). This omission makes the assertion of 'consistent expert specialization' and 'efficiently scaling model capacity' untestable from the manuscript.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it briefly specified the wavelet transform (e.g., discrete wavelet transform family and decomposition levels) used to generate the wavelet tokens.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current version of the manuscript requires substantial additions to support the claims and enable verification of the proposed architecture. We will perform a major revision to incorporate quantitative results, detailed methodological specifications, and experimental analyses as outlined below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that WaveMoE 'has the potential to further improve forecasting performance' rests on 'preliminary experimental results on 16 diverse benchmark datasets,' yet no quantitative metrics, baseline comparisons, ablation studies, error bars, or statistical tests are supplied. This absence leaves the performance contribution unsupported and directly undermines assessment of the architecture's value.

    Authors: We agree that the abstract's reference to preliminary results is insufficient without supporting numbers. In the revision we will insert concise quantitative statements (e.g., average MSE reduction and win-rate against baselines across the 16 datasets) while directing readers to the expanded Experiments section for full tables, error bars, ablations, and significance tests. This keeps the abstract within length limits yet makes the performance claim verifiable. revision: yes

  2. Referee: [Methods] Methods (architecture description): The dual-path design asserts that wavelet tokens are 'aligned along a unified temporal axis' and routed through the same experts as time tokens, but no equations, pseudocode, or details are given for the wavelet decomposition, token projection, temporal alignment procedure, or the shared router formulation (e.g., gating function or expert activation). Without these, it is impossible to verify whether alignment preserves scale-specific information or whether shared routing produces coherent specialization rather than interference.

    Authors: We acknowledge the omission of formal specifications. The revised Methods section will include: (1) the exact discrete wavelet transform equations and chosen mother wavelets, (2) linear projection layers that map wavelet coefficients to tokens, (3) the temporal alignment procedure (zero-padding and linear interpolation to a common sequence length), and (4) the shared router formulation (top-k gating function with auxiliary load-balancing loss). Pseudocode for the dual-path forward pass will be added as an algorithm box to demonstrate preservation of scale information and routing behavior. revision: yes

  3. Referee: [Experiments] Experiments: No details are provided on model scale, pretraining corpus, comparison models, or any analysis of expert routing behavior (e.g., activation histograms or specialization metrics across time vs. wavelet tokens). This omission makes the assertion of 'consistent expert specialization' and 'efficiently scaling model capacity' untestable from the manuscript.

    Authors: We agree these details are essential. The revised manuscript will add an Experimental Setup subsection specifying model dimensions, number of experts, pretraining corpus size and sources, the full list of baselines, and quantitative routing analyses (activation histograms, specialization scores, and load metrics separated by token type). These additions will directly substantiate the claims of consistent specialization and scalable capacity. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural proposal with external benchmark support

full rationale

The paper introduces WaveMoE as a dual-path MoE architecture that aligns wavelet tokens to the temporal axis of time-series tokens and routes both through shared experts. This is presented as a design choice whose value is assessed via empirical results on 16 external benchmarks rather than any closed-form derivation, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations, uniqueness theorems, or ansatzes are invoked that reduce the claimed performance gains to the inputs by construction. The central claim therefore remains an independent architectural hypothesis.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The claim rests on standard deep-learning assumptions about tokenization, expert routing, and the benefit of frequency-domain augmentation, plus the new architectural choice of wavelet token alignment.

axioms (2)
  • domain assumption Wavelet transforms can be aligned to the same temporal grid as raw time-series tokens without loss of information.
    Invoked by the dual-path architecture description.
  • ad hoc to paper Shared expert routing will produce consistent specialization across time and wavelet paths.
    Central to the coordination mechanism claimed in the abstract.
invented entities (1)
  • Wavelet tokens no independent evidence
    purpose: Explicit frequency-domain representation aligned to the time axis.
    Introduced as the second path in the dual-path architecture.

pith-pipeline@v0.9.0 · 5482 in / 1257 out tokens · 38689 ms · 2026-05-10T15:38:52.612488+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 13 canonical work pages · 2 internal anchors

  1. [1]

    Gift-eval: A benchmark for general time series forecasting model evaluation.arXiv preprint arXiv:2410.10393,

    Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Gift-eval: A benchmark for general time series forecasting model evaluation. arXiv preprint arXiv:2410.10393,

  2. [2]

    Chronos-2: From Univariate to Universal Forecasting

    Abdul Fatir Ansari, Oleksandr Shchur, Jaris K ¨uken, Andreas Auer, Boran Han, Pedro Mercado, Syama Sundar Rangapuram, Huibin Shen, Lorenzo Stella, Xiyuan Zhang, et al. Chronos-2: From univariate to universal forecasting.arXiv preprint arXiv:2510.15821,

  3. [3]

    Toto: Time series optimized transformer for observability, 2024

    Ben Cohen, Emaad Khwaja, Kan Wang, Charles Masson, Elise Ram ´e, Youssef Doubli, and Oth- mane Abou-Amal. Toto: Time series optimized transformer for observability.arXiv preprint arXiv:2407.07874,

  4. [4]

    Oats: Online data augmentation for time series foundation models.arXiv preprint arXiv:2601.19040,

    Junwei Deng, Chang Xu, Jiaqi W Ma, Ming Jin, Chenghao Liu, and Jiang Bian. Oats: Online data augmentation for time series foundation models.arXiv preprint arXiv:2601.19040,

  5. [5]

    Kairos: Toward adaptive and parameter-efficient time series foundation models.arXiv preprint arXiv:2509.25826, 2025

    Kun Feng, Shaocheng Lan, Yuchen Fang, Wenchao He, Lintao Ma, Xingyu Lu, and Kan Ren. Kairos: Towards adaptive and generalizable time series foundation models.arXiv preprint arXiv:2509.25826,

  6. [6]

    Bridging distribution gaps in time series foundation model pretraining with prototype-guided nor- malization.arXiv preprint arXiv:2504.10900,

    Peiliang Gong, Emadeldeen Eldele, Min Wu, Zhenghua Chen, Xiaoli Li, and Daoqiang Zhang. Bridging distribution gaps in time series foundation model pretraining with prototype-guided nor- malization.arXiv preprint arXiv:2504.10900,

  7. [7]

    FlowState: Sampling Rate Invariant Time Series Forecasting

    Lars Graf, Thomas Ortner, Stanis ´L Wo ´Ls ¸niak, Angeliki Pantazi, et al. Flowstate: Sampling rate invariant time series forecasting.arXiv preprint arXiv:2508.05287,

  8. [8]

    B., M ¨uller, S., Salinas, D., and Hutter, F

    Shi Bin Hoo, Samuel M¨uller, David Salinas, and Frank Hutter. From tables to time: How tabpfn-v2 outperforms specialized time series forecasting models.arXiv preprint arXiv:2501.02945,

  9. [9]

    Lstm–transformer-based robust hybrid deep learning model for financial time series forecasting.Sci, 7(1):7,

    6 ICLR 2026 Workshop on Time Series in the Age of Large Models (TSALM) Md R Kabir, Dipayan Bhadra, Moinul Ridoy, and Mariofanna Milanova. Lstm–transformer-based robust hybrid deep learning model for financial time series forecasting.Sci, 7(1):7,

  10. [10]

    arXiv preprint arXiv:2511.11698 , year=

    Chenghao Liu, Taha Aksu, Juncheng Liu, Xu Liu, Hanshu Yan, Quang Pham, Silvio Savarese, Doyen Sahoo, Caiming Xiong, and Junnan Li. Moirai 2.0: When less is more for time series forecasting.arXiv preprint arXiv:2511.11698, 2025a. Mengna Liu, Dong Xiang, Xu Cheng, Xiufeng Liu, Dalin Zhang, Shengyong Chen, and Christian S Jensen. Disentangling imperfect: A w...

  11. [11]

    Timeexpert: Boosting long time series forecasting with temporal mix of experts

    Xiaowen Ma, Shuning Ge, Fan Yang, Xiangyu Li, Yun Chen, Mengting Ma, Wei Zhang, and Zhipeng Liu. Timeexpert: Boosting long time series forecasting with temporal mix of experts. arXiv preprint arXiv:2509.23145,

  12. [12]

    N-beats: Neural ba- sis expansion analysis for interpretable time series forecasting

    7 ICLR 2026 Workshop on Time Series in the Age of Large Models (TSALM) Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-beats: Neural ba- sis expansion analysis for interpretable time series forecasting. InInternational Conference on Learning Representations,

  13. [13]

    Output scaling: Yinglong- delayed chain of thought in a large pretrained time series forecasting model.arXiv preprint arXiv:2506.11029,

    Xue Wang, Tian Zhou, Jinyang Gao, Bolin Ding, and Jingren Zhou. Output scaling: Yinglong- delayed chain of thought in a large pretrained time series forecasting model.arXiv preprint arXiv:2506.11029,

  14. [14]

    Time- moe: Billion-scale time series foundation models with mixture of experts

    Shi Xiaoming, Wang Shiyu, Nie Yuqi, Li Dianqi, Ye Zhou, Wen Qingsong, and Ming Jin. Time- moe: Billion-scale time series foundation models with mixture of experts. InICLR 2025: The Thirteenth International Conference on Learning Representations. International Conference on Learning Representations,

  15. [15]

    Towards neural scaling laws for time series foundation models

    Qingren Yao, Chao-Han Huck Yang, Renhe Jiang, Yuxuan Liang, Ming Jin, and Shirui Pan. Towards neural scaling laws for time series foundation models. InThe Thirteenth International Confer- ence on Learning Representations (ICLR 2025). International Conference on Learning Represen- tations,

  16. [16]

    Multi-order wavelet derivative transform for deep time series forecasting.arXiv preprint arXiv:2505.11781,

    Ziyu Zhou, Jiaxi Hu, Qingsong Wen, James T Kwok, and Yuxuan Liang. Multi-order wavelet derivative transform for deep time series forecasting.arXiv preprint arXiv:2505.11781,

  17. [17]

    8 ICLR 2026 Workshop on Time Series in the Age of Large Models (TSALM) A PRETRAININGDATACONSTRUCTION A.1 DATASETCOMPOSITION ANDDOMAINCOVERAGE WaveMoE is pretrained on a large-scale dataset built upon the Time-300B corpus (Xiaoming et al., 2025). Specifically, the pretraining data are derived from the publicly available Time-300B dataset after domain balan...

  18. [18]

    Missing-Value Handling.For windows that pass quality filtering, a unified missing-value handling strategy is applied

    and ensure that retained windows exhibit sufficient temporal variation, preventing the model from learning trivial or static patterns. Missing-Value Handling.For windows that pass quality filtering, a unified missing-value handling strategy is applied. All NaN and Inf values are replaced with zero to ensure numerical stability. In addition, a correspondin...

  19. [19]

    Overall, the results on these additional benchmarks are consistent with the main experimental findings

    Performance is evaluated using Mean Squared Error (MSE) and Mean Absolute Error (MAE) for consistent comparison across models. Overall, the results on these additional benchmarks are consistent with the main experimental findings. WaveMoE achieves competitive or leading performance across the majority of datasets and maintains strong stability under diver...

  20. [20]

    on example time series from three benchmark datasets. Overall, WaveMoE demonstrates closer alignment with the ground-truth series, partic- ularly in terms of peak and trough localization, amplitude reconstruction, and trend consistency. While Time-MoE generally captures the overall trajectory, noticeable deviations remain around ex- treme values. In cases...