pith. sign in

arxiv: 2509.15105 · v3 · pith:QLKQSZYInew · submitted 2025-09-18 · 💻 cs.LG

Super-Linear: A Lightweight Pretrained Mixture of Linear Experts for Time Series Forecasting

Pith reviewed 2026-05-25 08:19 UTC · model grok-4.3

classification 💻 cs.LG
keywords time series forecastingmixture of expertslinear modelszero-shot forecastingspectral gatingpretrained modelscomputational efficiencyinterpretability
0
0 comments X

The pith

A mixture of frequency-specialized linear experts matches deep pretrained models in time series forecasting while using far less compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Super-Linear as a pretrained mixture-of-experts architecture built from simple linear layers rather than deep networks. Each expert is trained on time series resampled to a particular frequency band so that it specializes in patterns at that scale. A lightweight spectral gating network then chooses which experts to activate for any new input. If the approach holds, it indicates that explicit frequency decomposition plus linear experts can replace the heavy nonlinear feature extraction used in current large forecasting models. This would let accurate zero-shot forecasting run on modest hardware with clearer internal decisions and greater stability when sampling rates change.

Core claim

Super-Linear replaces deep architectures with a collection of linear experts, each trained on data resampled to match a distinct frequency regime, and combines them through a spectral gating mechanism that selects experts on the basis of input frequency content. When pretrained across multiple frequency regimes, the resulting model delivers strong zero-shot performance on standard benchmarks while delivering substantial gains in computational efficiency, robustness to changes in sampling rate, and interpretability.

What carries the argument

Frequency-specialized linear experts selected by a lightweight spectral gating mechanism.

If this is right

  • Forecasting systems can run on edge devices with limited memory and power.
  • Accuracy stays stable when input data arrive at irregular or changed sampling rates.
  • Each prediction can be traced to the specific frequency bands that contributed most.
  • Pretraining large forecasting models becomes feasible on smaller compute budgets.
  • The same linear-expert structure can be retrained quickly for new domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same frequency-expert pattern may transfer to anomaly detection or imputation tasks that also rely on periodic structure.
  • One could measure how much performance changes if a small number of nonlinear experts are added to handle strongly chaotic series.
  • The explicit separation into frequency bands supplies a natural route to theoretical error bounds based on Fourier analysis.
  • Models of this form could be used to study which frequency ranges carry the most predictive information across different application domains.

Load-bearing premise

Linear experts that each handle one frequency band, chosen by spectral gating, are enough to capture the structure needed for accurate forecasting on diverse real-world datasets.

What would settle it

A test collection of multivariate series with mixed frequencies on which Super-Linear's average error exceeds that of Chronos or Time-MoE by more than ten percent.

Figures

Figures reproduced from arXiv: 2509.15105 by Hedi Zisling, Liran Nochumsohn, Omri Azencot, Raz Marshanski.

Figure 1
Figure 1. Figure 1: Performance versus inference time trade-off across dif￾ferent prominent pretrained TSF models. generalize to entirely unseen datasets and domains without retraining or fine-tuning (Shi et al., 2024; Liu et al., 2024d;c; Das et al., 2024). Recent work has made significant progress toward this goal. Large pretrained models such as Timer-XL, TimeMoE, and TimesFM (Liu et al., 2024c; Shi et al., 2024; Das et al… view at source ↗
Figure 2
Figure 2. Figure 2: Left: Forecasting performance of linear models on 12 sine-wave datasets with varying frequencies and added random walk noise. Performance improves progressively with more experts. Right: Weight sensitivity to seasonal lags—training on datasets with different seasonality (e.g., Births vs. Electricity) leads to divergent weight structures, suboptimal when shared. to time series forecasting, beginning with th… view at source ↗
Figure 3
Figure 3. Figure 3: Super-Linear architecture overview. A frequency-aware gating router computes sparse scores from the input frequencies, dynamically selecting a subset of linear experts (1) including pre-trained frequency specialists, complementary modules, and heuristic naïve and mean baselines (2) whose predictions are combined to produce the final forecast (3). 4. Experiments 4.1. Experimental Details Pre-training Datase… view at source ↗
Figure 4
Figure 4. Figure 4: Super-Linear training framework. Data is resampled to enrich frequency diversity. Stage 1: Each expert is trained independently on a predefined frequency ωi. Stage 2: The router and complementary layers are trained with frozen experts to enable dynamic expert selection. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 MASE GIFT-Eval Sundial 0.750 TimesFM-2.0 0.758 Chronos bolt-b 0.808 Chronos bolt-s 0.822 Super-Linear 0.857 Vi… view at source ↗
Figure 5
Figure 5. Figure 5: GIFT-Eval performance and parameter count of Super￾Linear compared to prominent foundation models in TSF (Creation Date: 01/08/2025). The MASE score represents the geometric mean MASE across datasets, normalized by the seasonal-naive. best scores in MSE and 16 best scores in mean absolute error (MAE), compared to Timer-XL’s 4 and 0 best scores. Remarkably, Super-Linear accomplishes this while being only 3%… view at source ↗
Figure 6
Figure 6. Figure 6: Dataset pre-training distribution for sampling rate or frequency. It is shown that other models rely heavily on hourly-sampled data. The frequencies for Super-Linear are given in App. 12 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Top-k expert distribution for different datasets using the Super-Linear pretrained model. ment, we assess the performance of a Super-Linear model pretrained with k = 12, comparing both inference-time ad￾justments (red curve) and retraining with different k values (blue curve). Results show dataset-dependent behavior, with a preference for the retrained setup for ETTh1, ETTm2, and Weather; however, adjustin… view at source ↗
Figure 8
Figure 8. Figure 8: Top-k expert selection comparison. Performance across varying top-k expert settings. Blue and red vertical lines indicate the k values used during pretraining and inference, respectively. • E⊥ bounds the (frequency)-approximation error: the signal energy outside the span of the N expert frequen￾cies. • The second term, which bounds the estimation error, penalises how far the learned gating G is from the op… view at source ↗
Figure 9
Figure 9. Figure 9: Complementary layer ablation for zero-shot evaluations across various datasets. The value 0 reflects Super-Linear with no complementary layers, whereas 12 represents Super-Linear with 12 complementary layers. A.4. Complementary Experts Ablation In this section, we evaluate the impact of the number of complementary layers used during pre-training across various datasets. The original pre-trained model, whos… view at source ↗
Figure 10
Figure 10. Figure 10: Super-Linear forecasting illustration with varying input lookbacks for the Electricity dataset. of Super-Linear’s frequency experts. It selects an appropriate resampling factor that shifts the input frequencies into a space the model can represent, while preserving the signal’s energy to avoid information loss. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Super-Linear forecasting illustration with varying input lookbacks for the Weather dataset. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The common frequencies utilized by Super-Linear and their corresponding sampling rates. Each frequency expert in Super￾Linear is learned with a frequency depicted in this figure, and each frequency is associated with one or more natural sampling rates. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visualization of the Super-Linear frequency and complementary expert weights. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
read the original abstract

Time series forecasting (TSF) is critical in domains like energy, finance, healthcare, and logistics, requiring models that generalize across diverse datasets. Large pre-trained models such as Chronos and Time-MoE show strong zero-shot (ZS) performance but suffer from high computational costs. In this work, we introduce Super-Linear, a lightweight and scalable mixture-of-experts (MoE) model for general forecasting. It replaces deep architectures with simple frequency-specialized linear experts, trained on resampled data across multiple frequency regimes. A lightweight spectral gating mechanism dynamically selects relevant experts, enabling efficient, accurate forecasting. Despite its simplicity, Super-Linear demonstrates strong performance across benchmarks, while substantially improving efficiency, robustness to sampling rates, and interpretability. The implementation of Super-Linear is available at: \href{https://github.com/azencot-group/SuperLinear}{https://github.com/azencot-group/SuperLinear}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces Super-Linear, a lightweight pretrained mixture-of-experts model for time series forecasting. It replaces deep nonlinear architectures with frequency-specialized linear experts trained on resampled data across multiple frequency regimes, combined with a lightweight spectral gating mechanism for dynamic expert selection. The central claim is that this simple construction achieves strong benchmark performance while substantially improving efficiency, robustness to sampling rates, and interpretability relative to large pretrained models such as Chronos and Time-MoE. The implementation is released on GitHub.

Significance. If the empirical results hold under rigorous verification, the work could meaningfully advance efficient time series forecasting by showing that a linear MoE with frequency specialization and spectral gating can compete with deep models on diverse real-world data. This would reduce computational barriers and improve interpretability in domains like energy and finance. The open-source code is a clear strength for reproducibility and follow-on research.

major comments (1)
  1. [§3] §3 (Method): The central claim that frequency-specialized linear experts plus spectral gating suffice to match or exceed deep pretrained models' generalization rests on the unverified assumption that the linear span can capture regime shifts and non-stationary nonlinearities without nonlinear feature extraction. No derivation, approximation bound, or analysis of when this holds is provided, leaving the load-bearing sufficiency argument unsupported beyond the empirical comparisons.
minor comments (1)
  1. [Abstract] Abstract: Key quantitative results (e.g., MAE or MSE deltas versus baselines, parameter counts, inference times) should be included to allow readers to assess the strength of the performance claims without immediately consulting the full experimental section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the methodological foundations of Super-Linear. We address the concern regarding the lack of theoretical support for the sufficiency of linear experts below.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The central claim that frequency-specialized linear experts plus spectral gating suffice to match or exceed deep pretrained models' generalization rests on the unverified assumption that the linear span can capture regime shifts and non-stationary nonlinearities without nonlinear feature extraction. No derivation, approximation bound, or analysis of when this holds is provided, leaving the load-bearing sufficiency argument unsupported beyond the empirical comparisons.

    Authors: We agree that the manuscript provides no derivation, approximation bound, or formal analysis establishing when the linear span of frequency-specialized experts is sufficient to capture regime shifts and non-stationary nonlinearities. The central motivation for the architecture is empirical: frequency-domain resampling and spectral gating allow each linear expert to specialize on distinct periodic components, which our experiments show generalizes competitively with deep models across heterogeneous benchmarks. We have revised Section 3 to (i) state the assumption explicitly, (ii) add a short discussion of its scope and limitations, and (iii) clarify that all generalization claims rest on the reported empirical results rather than theoretical guarantees. A rigorous theoretical characterization is left for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical model with benchmark claims, no derivation chain

full rationale

The paper presents an architectural design (frequency-specialized linear experts + spectral gating) trained on resampled data and evaluated empirically on forecasting benchmarks. No mathematical derivation, first-principles result, or prediction is claimed that reduces by construction to fitted parameters, self-citations, or renamed inputs. Performance statements rest on external comparisons rather than self-referential equations. This is the expected non-finding for an applied ML methods paper without a load-bearing theoretical chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; all modeling choices remain implicit.

pith-pipeline@v0.9.0 · 5701 in / 1071 out tokens · 28592 ms · 2026-05-25T08:19:46.191656+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. XCTFormer: Leveraging Cross-Channel and Cross-Time Dependencies for Enhanced Time-Series Analysis

    cs.LG 2026-05 unverdicted novelty 5.0

    XCTFormer is a channel-dependent transformer that uses token-to-token cross-relational attention and an optional compression plugin to capture cross-channel and cross-time dependencies, reporting SOTA imputation resul...

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Gift-eval: A benchmark for general time series forecasting model evaluation

    Aksu, T., Woo, G., Liu, J., Liu, X., Liu, C., Savarese, S., Xiong, C., and Sahoo, D. Gift-eval: A benchmark for general time series forecasting model evaluation. arxiv preprint arxiv:2410.10393, 2024

  3. [3]

    Chronos: Learning the Language of Time Series

    Ansari, A. F., Stella, L., Turkmen, C., Zhang, X., Mercado, P., Shen, H., Shchur, O., Rangapuram, S. S., Arango, S. P., Kapoor, S., et al. Chronos: Learning the language of time series. arXiv preprint arXiv:2403.07815, 2024

  4. [4]

    Uci machine learning repository, 2007

    Asuncion, A., Newman, D., et al. Uci machine learning repository, 2007

  5. [5]

    A survey on mixture of experts

    Cai, W., Jiang, J., Wang, F., Tang, J., Kim, S., and Huang, J. A survey on mixture of experts. arXiv preprint arXiv:2407.06204, 2024

  6. [6]

    O., Pfister, T., Zheng, Y., Ye, W., and Liu, Y

    Cao, D., Jia, F., Arik, S. O., Pfister, T., Zheng, Y., Ye, W., and Liu, Y. Tempo: Prompt-based generative pre-trained transformer for time series forecasting. arXiv preprint arXiv:2310.04948, 2023

  7. [7]

    G., Oreshkin, B

    Challu, C., Olivares, K. G., Oreshkin, B. N., Ramirez, F. G., Canseco, M. M., and Dubrawski, A. N-HiTS: Neural Hierarchical Interpolation for Time Series Forecasting . In Proceedings of the AAAI Conference on Artificial Intelligence, 2023

  8. [8]

    J., Sun, J., and Liu, C

    Chen, M., Shen, L., Li, Z., Wang, X. J., Sun, J., and Liu, C. Visionts: Visual masked autoencoders are free-lunch zero-shot time series forecasters, 2024. URL https://arxiv.org/abs/2408.17253

  9. [9]

    A decoder-only foundation model for time-series forecasting

    Das, A., Kong, W., Sen, R., and Zhou, Y. A decoder-only foundation model for time-series forecasting. In Forty-first International Conference on Machine Learning, 2024

  10. [10]

    M., Reddy, C., and Kalagnanam, J

    Ekambaram, V., Jati, A., Dayama, P., Mukherjee, S., Nguyen, N., Gifford, W. M., Reddy, C., and Kalagnanam, J. Tiny time mixers (ttms): Fast pre-trained models for enhanced zero/few-shot forecasting of multivariate time series. Advances in Neural Information Processing Systems, 37: 0 74147--74181, 2024

  11. [11]

    I., Hyndman, R

    Godahewa, R., Bergmeir, C., Webb, G. I., Hyndman, R. J., and Montero-Manso, P. Monash time series forecasting archive. In Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

  12. [12]

    Moment: A family of open time-series foundation models

    Goswami, M., Szafer, K., Choudhry, A., Cai, Y., Li, S., and Dubrawski, A. Moment: A family of open time-series foundation models. arXiv preprint arXiv:2402.03885, 2024

  13. [13]

    A., Jordan, M

    Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. Adaptive mixtures of local experts. Neural computation, 3 0 (1): 0 79--87, 1991

  14. [14]

    Modeling long-and short-term temporal patterns with deep neural networks

    Lai, G., Chang, W.-C., Yang, Y., and Liu, H. Modeling long-and short-term temporal patterns with deep neural networks. In The 41st international ACM SIGIR conference on research & development in information retrieval, pp.\ 95--104, 2018

  15. [15]

    Revisiting Long-term Time Series Forecasting: An Investigation on Linear Mapping

    Li, Z., Qi, S., Li, Y., and Xu, Z. Revisiting long-term time series forecasting: An investigation on linear mapping. arXiv preprint arXiv:2305.10721, 2023

  16. [16]

    Foundation models for time series analysis: A tutorial and survey

    Liang, Y., Wen, H., Nie, Y., Jiang, Y., Jin, M., Song, D., Pan, S., and Wen, Q. Foundation models for time series analysis: A tutorial and survey. In Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pp.\ 6555--6565, 2024

  17. [17]

    \"O ., Loeff, N., and Pfister, T

    Lim, B., Ar k, S. \"O ., Loeff, N., and Pfister, T. Temporal fusion transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting, 37 0 (4): 0 1748--1764, 2021

  18. [18]

    Cyclenet: enhancing time series forecasting through modeling periodic patterns

    Lin, S., Lin, W., Hu, X., Wu, W., Mo, R., and Zhong, H. Cyclenet: enhancing time series forecasting through modeling periodic patterns. Advances in Neural Information Processing Systems, 37: 0 106315--106345, 2024 a

  19. [19]

    Sparsetsf: Modeling long-term time series forecasting with 1k parameters

    Lin, S., Lin, W., Wu, W., Chen, H., and Yang, J. Sparsetsf: Modeling long-term time series forecasting with 1k parameters. arXiv preprint arXiv:2405.00946, 2024 b

  20. [20]

    Unitst: Effectively modeling inter-series and intra-series dependencies for multivariate time series forecasting

    Liu, J., Liu, C., Woo, G., Wang, Y., Hooi, B., Xiong, C., and Sahoo, D. Unitst: Effectively modeling inter-series and intra-series dependencies for multivariate time series forecasting. arXiv preprint arXiv:2406.04975, 2024 a

  21. [21]

    Scinet: Time series modeling and forecasting with sample convolution and interaction

    Liu, M., Zeng, A., Chen, M., Xu, Z., Lai, Q., Ma, L., and Xu, Q. Scinet: Time series modeling and forecasting with sample convolution and interaction. Advances in Neural Information Processing Systems, 35: 0 5816--5828, 2022

  22. [22]

    Unitime: A language-empowered unified model for cross-domain time series forecasting

    Liu, X., Hu, J., Li, Y., Diao, S., Liang, Y., Hooi, B., and Zimmermann, R. Unitime: A language-empowered unified model for cross-domain time series forecasting. In Proceedings of the ACM Web Conference 2024, pp.\ 4095--4106, 2024 b

  23. [23]

    iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

    Liu, Y., Hu, T., Zhang, H., Wu, H., Wang, S., Ma, L., and Long, M. itransformer: Inverted transformers are effective for time series forecasting. arXiv preprint arXiv:2310.06625, 2023

  24. [24]

    Timer-xl: Long-context transformers for unified time series forecasting

    Liu, Y., Qin, G., Huang, X., Wang, J., and Long, M. Timer-xl: Long-context transformers for unified time series forecasting. arXiv preprint arXiv:2410.04803, 2024 c

  25. [25]

    Timer: Generative pre-trained transformers are large time series models

    Liu, Y., Zhang, H., Li, C., Huang, X., Wang, J., and Long, M. Timer: Generative pre-trained transformers are large time series models. arXiv preprint arXiv:2402.02368, 2024 d

  26. [26]

    Sundial: A Family of Highly Capable Time Series Foundation Models

    Liu, Y., Qin, G., Shi, Z., Chen, Z., Yang, C., Huang, X., Wang, J., and Long, M. Sundial: A family of highly capable time series foundation models. arXiv preprint arXiv:2502.00816, 2025

  27. [27]

    Freqmoe: Enhancing time series forecasting through frequency decomposition mixture of experts

    Liu, Z. Freqmoe: Enhancing time series forecasting through frequency decomposition mixture of experts. arXiv preprint arXiv:2501.15125, 2025

  28. [28]

    Mancuso, P., Piccialli, V., and Sudoso, A. M. A machine learning approach for forecasting hierarchical time series. Expert Systems with Applications, 182: 0 115102, 2021

  29. [29]

    Mixture-of-linear-experts for long-term time series forecasting

    Ni, R., Lin, Z., Wang, S., and Fanti, G. Mixture-of-linear-experts for long-term time series forecasting. In International Conference on Artificial Intelligence and Statistics, pp.\ 4672--4680. PMLR, 2024

  30. [30]

    H., Sinthong, P., and Kalagnanam, J

    Nie, Y., Nguyen, N. H., Sinthong, P., and Kalagnanam, J. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers . In The Eleventh International Conference on Learning Representations, ICLR , 2023

  31. [31]

    A multi-task learning approach to linear multivariate forecasting

    Nochumsohn, L., Zisling, H., and Azencot, O. A multi-task learning approach to linear multivariate forecasting. In International Conference on Artificial Intelligence and Statistics. PMLR, 202t

  32. [32]

    A., Ott, E., Pomerance, A., Hunt, B., and Girvan, M

    Norton, D. A., Ott, E., Pomerance, A., Hunt, B., and Girvan, M. Tailored forecasting from short time series via meta-learning. arXiv preprint arXiv:2501.16325, 2025

  33. [33]

    H., Dayama, P., Sindhgatta, R., Mohapatra, P., et al

    Palaskar, S., Ekambaram, V., Jati, A., Gantayat, N., Saha, A., Nagar, S., Nguyen, N. H., Dayama, P., Sindhgatta, R., Mohapatra, P., et al. Automixer for improved multivariate time-series forecasting on business and it observability data. In Proceedings of the AAAI conference on artificial intelligence, pp.\ 22962--22968, 2024

  34. [34]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library . Advances in Neural Information Processing Systems, 32, 2019

  35. [35]

    Kedformer: Knowledge extraction seasonal trend decomposition for long-term sequence prediction

    Qin, Z., Wei, B., Gao, C., and Ni, J. Kedformer: Knowledge extraction seasonal trend decomposition for long-term sequence prediction. arXiv preprint arXiv:2412.05421, 2024

  36. [36]

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020

  37. [37]

    DeepAR : Probabilistic forecasting with autoregressive recurrent networks

    Salinas, D., Flunkert, V., Gasthaus, J., and Januschowski, T. DeepAR : Probabilistic forecasting with autoregressive recurrent networks. International journal of forecasting, 36 0 (3): 0 1181--1191, 2020

  38. [38]

    On the investigation of hidden periodicities with application to a supposed 26 day period of meteorological phenomena

    Schuster, A. On the investigation of hidden periodicities with application to a supposed 26 day period of meteorological phenomena. Terrestrial Magnetism, 3 0 (1): 0 13--41, 1898

  39. [39]

    and Ben-David, S

    Shalev-Shwartz, S. and Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge, 1 edition, 2014

  40. [40]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

  41. [41]

    Statistical characterization of business-critical workloads hosted in cloud datacenters

    Shen, S., Van Beek, V., and Iosup, A. Statistical characterization of business-critical workloads hosted in cloud datacenters. In 2015 15th IEEE/ACM international symposium on cluster, cloud and grid computing, pp.\ 465--474. IEEE, 2015

  42. [42]

    Time-moe: Billion-scale time series foundation models with mixture of experts

    Shi, X., Wang, S., Nie, Y., Li, D., Ye, Z., Wen, Q., and Jin, M. Time-moe: Billion-scale time series foundation models with mixture of experts. arXiv preprint arXiv:2409.16040, 2024

  43. [43]

    Shumway, R. H. and Stoffer, D. S. Time series analysis and its applications, volume 3. Springer, 2000

  44. [44]

    Taylor, S. J. and Letham, B. Forecasting at scale. The American Statistician, 72 0 (1): 0 37--45, 2018

  45. [45]

    Forecasting monthly and quarterly time series using stl decomposition

    Theodosiou, M. Forecasting monthly and quarterly time series using stl decomposition. International Journal of Forecasting, 27 0 (4): 0 1178--1195, 2011

  46. [46]

    and Darlow, L

    Toner, W. and Darlow, L. An analysis of linear time series forecasting models. arXiv preprint arXiv:2403.14587, 2024

  47. [47]

    N., Kaiser, ., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017

  48. [48]

    Wang, J., Jiang, J., Jiang, W., Han, C., and Zhao, W. X. Towards efficient and comprehensive urban spatial-temporal prediction: A unified library and performance benchmark. arXiv e-prints, pp.\ arXiv--2304, 2023

  49. [49]

    Towards a general time series forecasting model with unified representation and adaptive transfer

    Wang, Y., Qiu, Y., Chen, P., Zhao, K., Shu, Y., Rao, Z., Pan, L., Yang, B., and Guo, C. Towards a general time series forecasting model with unified representation and adaptive transfer. In Forty-second International Conference on Machine Learning

  50. [50]

    Woo, G., Liu, C., Sahoo, D., Kumar, A., and Hoi, S. C. H. CoST : Contrastive learning of disentangled seasonal-trend representations for time series forecasting. In The Tenth International Conference on Learning Representations, ICLR . OpenReview.net, 2022

  51. [51]

    Unified training of universal time series forecasting transformers

    Woo, G., Liu, C., Kumar, A., Xiong, C., Savarese, S., and Sahoo, D. Unified training of universal time series forecasting transformers. In Forty-first International Conference on Machine Learning, ICML . OpenReview.net, 2024

  52. [52]

    Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting

    Wu, H., Xu, J., Wang, J., and Long, M. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting . Advances in Neural Information Processing Systems, 2021

  53. [53]

    TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis

    Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., and Long, M. Timesnet: Temporal 2d-variation modeling for general time series analysis. arXiv preprint arXiv:2210.02186, 2022

  54. [54]

    Fits: Modeling time series with 10 k parameters

    Xu, Z., Zeng, A., and Xu, Q. Fits: Modeling time series with 10 k parameters. arXiv preprint arXiv:2307.03756, 2023

  55. [55]

    Time series prediction using mixtures of experts

    Zeevi, A., Meir, R., and Adler, R. Time series prediction using mixtures of experts. Advances in neural information processing systems, 9, 1996

  56. [56]

    Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, pp.\ 11121--11128, 2023

    Zeng, A., Chen, M., Zhang, L., and Xu, Q. Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, pp.\ 11121--11128, 2023

  57. [57]

    A comprehensive survey on pretrained foundation models: A history from bert to chatgpt

    Zhou, C., Li, Q., Li, C., Yu, J., Liu, Y., Wang, G., Zhang, K., Ji, C., Yan, Q., He, L., et al. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. International Journal of Machine Learning and Cybernetics, pp.\ 1--65, 2024

  58. [58]

    Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

    Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., and Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting . In Proceedings of the AAAI conference on artificial intelligence, 2021

  59. [59]

    FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting

    Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., and Jin, R. FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting . In International Conference on Machine Learning. PMLR, 2022

  60. [60]

    One fits all: Power general time series analysis by pretrained lm

    Zhou, T., Niu, P., Sun, L., Jin, R., et al. One fits all: Power general time series analysis by pretrained lm. Advances in neural information processing systems, 36: 0 43322--43355, 2023