Olivia: Harmonizing Time Series Foundation Models with Power Spectral Density

Alex Xing Wang; Jingru Fei; Kun Yi; Qingsong Wen; Wei Fan; Xiangxiang Zhu

arxiv: 2605.17340 · v2 · pith:UAXVANY3new · submitted 2026-05-17 · 💻 cs.LG

Olivia: Harmonizing Time Series Foundation Models with Power Spectral Density

Jingru Fei , Kun Yi , Alex Xing Wang , Qingsong Wen , Xiangxiang Zhu , Wei Fan This is my paper

Pith reviewed 2026-05-20 14:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords time series foundation modelspower spectral densitydataset harmonizationHarmonizer moduleHarmonicAttentionforecastingpretrainingspectral domain

0 comments

The pith

Olivia harmonizes time series datasets through power spectral density matching to produce stronger transferable representations for foundation model pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that mismatches in temporal patterns across heterogeneous time series datasets limit the quality of pretraining for foundation models. By reformulating dataset harmonization as an implicit reshaping of normalized power spectral densities, a new Harmonizer module is introduced that aligns second-order temporal correlations without direct intractable optimization. This alignment enables a compact HarmonicAttention mechanism that performs self-attention through a small set of resonators in a low-dimensional space. The resulting model Olivia is then shown to deliver consistent gains over prior approaches when evaluated in zero-shot, few-shot, and full-shot forecasting regimes across multiple large benchmarks.

Core claim

Harmonizing datasets via PSDs in the spectral domain reduces mismatches and enhances pretraining effectiveness. The Harmonizer module reshapes spectral structures and implicitly harmonizes PSDs across datasets, which theoretically corresponds to a shared reparameterization of second-order temporal correlations. Token interactions with the Harmonizer can be efficiently mediated by a compact set of resonators, motivating HarmonicAttention that performs self-attention in a low-dimensional interaction space. Olivia, built on these mechanisms, achieves state-of-the-art performance on TSLib, GIFT-Eval, and six additional GluonTS datasets under zero-shot, few-shot, and full-shot forecasting.

What carries the argument

The Harmonizer module, which reshapes spectral structures to implicitly harmonize normalized power spectral densities and thereby reparameterize second-order temporal correlations across datasets.

If this is right

Harmonizer enables more effective pretraining on heterogeneous collections by aligning spectral properties without explicit pairwise optimization.
HarmonicAttention reduces token interaction complexity to a low-dimensional resonator space while preserving the benefits of the harmonized representations.
The same reparameterization of second-order correlations supports consistent gains in zero-shot, few-shot, and full-shot regimes on both TSLib and GIFT-Eval.
The approach generalizes across six additional GluonTS datasets without domain-specific retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The resonator-based attention could be tested for transfer to non-time-series sequence tasks that also exhibit frequency structure.
If the spectral alignment proves robust, the same Harmonizer could be inserted into existing foundation models to improve their cross-domain performance with minimal added cost.
A direct comparison of learned embeddings before and after Harmonizer application on mixed-domain batches would quantify how much the second-order correlation reparameterization actually reduces distribution shift.

Load-bearing premise

The assumption that harmonizing datasets via PSDs in the spectral domain reduces mismatches and enhances pretraining effectiveness.

What would settle it

Training an otherwise identical foundation model without the Harmonizer on the TSLib benchmark and measuring whether zero-shot and few-shot forecasting metrics remain equal to or exceed those reported for Olivia would directly test the contribution of the PSD harmonization step.

Figures

Figures reproduced from arXiv: 2605.17340 by Alex Xing Wang, Jingru Fei, Kun Yi, Qingsong Wen, Wei Fan, Xiangxiang Zhu.

**Figure 1.** Figure 1: (a): Original and Harmonizer-processed time series from different domains. Raw time series from multiple domains (left) show distinct temporal patterns. After processing (right), patterns show more consistent across domains. (b): Jensen–Shannon (JS) divergence between dataset-level normalized power spectral density distributions across different domain pairs. JS divergence is bounded in [0, ln 2] (approxim… view at source ↗

**Figure 2.** Figure 2: The overall architecture of Olivia. Olivia adopts an encoder–decoder architecture centered on (i) the Harmonizer, which consists of an Aligner and a Restorer to perform bidirectional temporal reorganization across heterogeneous datasets, and (ii) the HarmonicFormer, a scalable backbone that utilizes HarmonicAttention for efficient temporal dependency modeling within a compressed subspace. Kullback–Leibler … view at source ↗

**Figure 3.** Figure 3: Full-shot results on the ETTh2, ETTm2, Weather and Traffic datasets. There ported results are averaged across all prediction lengths. Full details in Appendix G.2. such as Electricity and Traffic, and substantially outperforms the large-scale Time-MoE family, reducing MSE by an average of 26.3% and MAE by 14.6% on the reported datasets. Similar trends are observed on datasets provided by GluonTS (see [P… view at source ↗

**Figure 4.** Figure 4: Visualization of the second-order temporal correlation comparison between original data and those processed by Aligner. This figure illustrates the temporal correlation of raw time series from diverse pretraining datasets spanning multiple domains (Left), together with the corresponding correlation patterns after processing with the Aligner in the Harmonizer. Comparable structures are further observed on d… view at source ↗

**Figure 5.** Figure 5: Ablations of Harmonizer under the zero-shot scenario. 96 192 336 720 Prediction Length 0.380 0.390 0.400 0.410 0.420 0.430 0.440 0.450 0.460 MSE Value 0.382 0.404 0.415 0.441 0.389 0.414 0.426 + Harmonizer 0.450 original model (a) ETTh1 96 192 336 720 Prediction Length 0.160 0.180 0.200 0.220 0.240 MSE Value 0.161 0.172 0.190 0.232 0.169 0.183 0.199 + Harmonizer 0.239 original model (b) Electricity [PITH_… view at source ↗

**Figure 6.** Figure 6: Effectiveness of the proposed Harmonizer on SEMPO. Both the original SEMPO and the Harmonizer-enhanced SEMPO are trained from scratch and evaluated under the zero-shot setting. consistently leads to degraded performance across ETTh1, Electricity, and Weather. This suggests that the observed gains are not merely due to attention expressiveness, but rather stem from the structured low-dimensional interactio… view at source ↗

**Figure 7.** Figure 7: Overview of Olivia’s training and inference pipeline across the pretraining, tuning, and inference stages. Tuning stage. During the tuning stage, Olivia adapts the pretrained representations to task-oriented forecasting while retaining the PSD-consistent temporal organization learned during pretraining. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of temporal patterns with and without StandardScaler1 on sequences sampled from the same dataset in the Energy domain. Time series segments of length T = 512 are extracted starting from different time indices ti ∈ {10, 600, 1300}. The top row shows sequences after applying StandardScaler, while the bottom row shows the corresponding raw sequences without scaling. Although StandardScaler resca… view at source ↗

**Figure 9.** Figure 9: Visualization of temporal correlation. This figure illustrates the temporal correlation of raw time series from different domains in the pretraining datasets, as well as the corresponding correlation patterns after processing with the Aligner in the Harmonizer. F.2. Weight Visualization of Learnable Orthogonal Matrix Q [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: visualizes the learned orthogonal matrix Q, where the structured patterns can be observed beyond a trivial identitylike form. While the dominant diagonal reflects the orthogonality constraint, the presence of non-negligible off-diagonal components indicates that Q performs non-trivial temporal reorganization rather than simple time-wise preservation. To further examine how such structure is constructed, … view at source ↗

**Figure 11.** Figure 11: Visualization of learnable Householder Reflections {Hk} at different reflection indices k. The structured patterns across reflections illustrate how successive learnable Householder transformations contribute to the construction of the overall orthogonal matrix Q. To better reveal the learned transformation patterns, each reflection is shown relative to the identity matrix. G. Supplementary Results G.1. Z… view at source ↗

**Figure 12.** Figure 12: Visualization of prediction results on the ETTm2 dataset under the zero-shot setting, with an input length of 512 and a prediction horizon of 336. is crucial for effective few-shot adaptation. Relative to LLM-based models, including Time-LLM, GPT4TS, and S 2 IPLLM, Olivia delivers substantially more stable and competitive performance across all benchmarks. While LLM-based approaches benefit from strong l… view at source ↗

**Figure 13.** Figure 13: Visualization of prediction results on the Weather dataset under the zero-shot setting, with an input length of 512 and a prediction horizon of 336. A similar pattern is observed for SEMPO, where SEMPOB (T = 512, 6.5M parameters) often matches or outperforms its larger counterparts SEMPOE (T = 1024, 7.3M parameters) and SEMPOA (T = 1536, 9.9M parameters). Notably, despite using fewer parameters, Olivia va… view at source ↗

**Figure 14.** Figure 14: Scaling analysis under the zero-shot setting, considering both model size and the scale of pretraining datasets. The reported results are averaged across all prediction lengths. G.5. Parameter Sensitivity Analysis The number of learnable Householder reflections K [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Parameter sensitivity analysis on the number K of learnable Householder reflections Hk used to construct the orthogonal matrix Q. The reported results are averaged across all prediction lengths [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗

read the original abstract

Time series foundation models rely on large-scale pretraining over diverse datasets across domains, yet their heterogeneity in temporal patterns could hinder the effectiveness of training and learning transferable time series representations. Inspired a fundamental concept, normalized power spectral density (PSD) in signal processing, we assume harmonizing datasets via PSDs in the spectral domain could reduce mismatches and enhance pretraining. We then go beyond the direct intractable minimization optimization and innovatively reformulate it as a principled harmonization approach. Specifically, we propose Harmonizer, a module that reshapes spectral structures and implicitly harmonizing PSDs across datasets, which theoretically corresponds to a shared reparameterization of second-order temporal correlations. Our theoretical analysis further reveals token interactions with Harmonizer can be efficiently mediated by a compact set of resonators, motivating a HarmonicAttention design that performs self-attention in a low-dimensional interaction space. Then, we propose Olivia, a novel time series foundation model built upon these harmonization mechanisms. Extensive experiments on two large-scale benchmarks (TSLib and GIFT-Eval) and extra 6 datasets from GluonTS, demonstrate Olivia consistently achieves state-of-the-art performance under zero-shot, few-shot, and full-shot forecasting scenarios. Our code is available at https://github.com/TSTS13/Olivia.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Olivia, a time series foundation model that addresses heterogeneity across pretraining datasets by proposing a Harmonizer module to reshape spectral structures and implicitly harmonize normalized power spectral densities (PSDs). This is presented as a principled reformulation corresponding to a shared reparameterization of second-order temporal correlations. The work further derives HarmonicAttention, in which token interactions are mediated by a compact set of resonators in a low-dimensional space. Extensive experiments claim state-of-the-art zero-shot, few-shot, and full-shot forecasting performance on the TSLib and GIFT-Eval benchmarks plus six additional GluonTS datasets.

Significance. If the central theoretical correspondence between PSD reshaping and second-order correlation reparameterization is rigorously derived and if controlled experiments demonstrate that the harmonization mechanism (rather than scale or the resonator attention alone) drives the reported gains, the approach could provide a signal-processing-inspired route to more robust transferable representations in time-series foundation models. The resonator-based efficiency argument and the explicit link to PSD harmonization would constitute a distinctive contribution.

major comments (2)

[Abstract and §3] Abstract and §3 (Harmonizer design): The claim that the Harmonizer 'implicitly harmonizing PSDs across datasets' and 'theoretically corresponds to a shared reparameterization of second-order temporal correlations' is load-bearing for attributing performance gains to the proposed mechanism. No derivation, explicit equations, or quantitative diagnostics (pre-/post-harmonization PSD divergence, Wasserstein distance on spectra, or controlled mismatch metrics) are supplied to verify that the reshaping actually reduces cross-dataset spectral mismatch.
[§5] §5 (Experiments): The manuscript reports consistent SOTA results across zero-/few-/full-shot regimes on TSLib, GIFT-Eval, and GluonTS. However, the absence of ablations that disable the Harmonizer while retaining HarmonicAttention and identical training scale leaves open the possibility that gains arise from the resonator attention or data scale rather than from PSD harmonization. A single controlled comparison isolating the harmonization component is required to support the central causal claim.

minor comments (2)

The GitHub repository link is welcome; the released code should include exact hyper-parameters, random seeds, and the precise preprocessing pipelines used for the six additional GluonTS datasets to ensure full reproducibility.
[§4] Notation for the resonator set and the low-dimensional interaction space in the HarmonicAttention derivation should be introduced with an explicit dimension-reduction equation (e.g., relating resonator count K to original sequence length) to clarify the claimed efficiency gain.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive comments, which help clarify the presentation of our theoretical claims and experimental validation. We address each major point below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Harmonizer design): The claim that the Harmonizer 'implicitly harmonizing PSDs across datasets' and 'theoretically corresponds to a shared reparameterization of second-order temporal correlations' is load-bearing for attributing performance gains to the proposed mechanism. No derivation, explicit equations, or quantitative diagnostics (pre-/post-harmonization PSD divergence, Wasserstein distance on spectra, or controlled mismatch metrics) are supplied to verify that the reshaping actually reduces cross-dataset spectral mismatch.

Authors: We appreciate the referee's focus on rigorous verification of the central claim. Section 3 derives the correspondence by noting that normalized PSD reshaping via the Harmonizer is equivalent to reparameterizing the autocorrelation function (by the Wiener-Khinchin theorem), which governs second-order temporal correlations; this is presented as a shared reparameterization across datasets. To make the argument fully explicit and address the concern, we will insert the complete step-by-step derivation with all intermediate equations in the revised §3. We will also add quantitative diagnostics, including pre- and post-harmonization PSD divergence (e.g., Wasserstein distance between spectral distributions) and cross-dataset mismatch metrics, to empirically confirm the reduction in spectral heterogeneity. revision: yes
Referee: [§5] §5 (Experiments): The manuscript reports consistent SOTA results across zero-/few-/full-shot regimes on TSLib, GIFT-Eval, and GluonTS. However, the absence of ablations that disable the Harmonizer while retaining HarmonicAttention and identical training scale leaves open the possibility that gains arise from the resonator attention or data scale rather than from PSD harmonization. A single controlled comparison isolating the harmonization component is required to support the central causal claim.

Authors: We agree that isolating the Harmonizer's contribution is necessary to support the causal attribution. The current experiments demonstrate overall gains but do not include the requested controlled ablation. In the revision we will add an ablation that removes the Harmonizer while retaining HarmonicAttention, identical model scale, and training regime, thereby directly comparing performance with and without the PSD-harmonization mechanism to confirm its role in the reported improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper starts from an explicit assumption that PSD-based harmonization reduces dataset mismatches, then describes a reformulation of an intractable objective into the Harmonizer module whose action is defined to reshape spectral structures and thereby implicitly harmonize PSDs. This correspondence is presented as a theoretical consequence of the module's design rather than an independent derivation that reduces to the input data or a fitted parameter. The subsequent HarmonicAttention and Olivia model are motivated by this construction but do not claim to predict benchmark metrics by construction; reported SOTA results are empirical evaluations on TSLib, GIFT-Eval, and GluonTS. No self-citation chains, uniqueness theorems from prior author work, or renaming of known results appear as load-bearing steps in the provided abstract and motivation. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the unproven assumption that spectral-domain PSD harmonization improves transferable representations, plus the modeling choice that a compact resonator set suffices for token interactions. No explicit free parameters or invented physical entities are named in the abstract.

axioms (1)

domain assumption Harmonizing datasets via normalized PSDs in the spectral domain reduces mismatches and enhances pretraining effectiveness.
Directly stated as the motivating assumption in the abstract.

invented entities (2)

Harmonizer module no independent evidence
purpose: Reshapes spectral structures to implicitly harmonize PSDs across datasets.
Introduced as the core technical component corresponding to shared reparameterization of second-order temporal correlations.
HarmonicAttention with resonators no independent evidence
purpose: Performs self-attention in a low-dimensional interaction space mediated by a compact set of resonators.
Motivated by the theoretical analysis of token interactions under the Harmonizer.

pith-pipeline@v0.9.0 · 5766 in / 1461 out tokens · 38796 ms · 2026-05-20T14:37:10.052786+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Harmonizer introduces learnable orthogonal temporal transformations that reorganize temporal dynamics, reshaping spectral structures and implicitly harmonizing PSDs across datasets. From a theoretical perspective, it corresponds to a shared reparameterization of second-order temporal correlations
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 1 ... shared orthogonal matrix Q ... block-diagonal moment matrix ΣX_D = diag(Λ_D, Φ_D)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 9 internal anchors

[1]

arXiv preprint arXiv:2410.10393(2024)

Aksu, T., Woo, G., Liu, J., Liu, X., Liu, C., Savarese, S., Xiong, C., and Sahoo, D. Gift-eval: A benchmark for general time series forecasting model evaluation.arXiv preprint arXiv:2410.10393,

work page arXiv
[2]

URL http://jmlr.org/papers/v21/19-820. html. Ansari, A. F., Stella, L., Turkmen, C., Zhang, X., Mercado, P., Shen, H., Shchur, O., Rangapuram, S. S., Arango, S. P., Kapoor, S., et al. Chronos: Learning the language of time series.arXiv preprint arXiv:2403.07815,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Monash time series forecasting archive

Godahewa, R., Bergmeir, C., Webb, G. I., Hyndman, R. J., and Montero-Manso, P. Monash time series forecasting archive.arXiv preprint arXiv:2105.06643,

work page arXiv
[4]

Moment: A family of open time-series foundation models

Goswami, M., Szafer, K., Choudhry, A., Cai, Y ., Li, S., and Dubrawski, A. Moment: A family of open time-series foundation models.arXiv preprint arXiv:2402.03885,

work page arXiv
[5]

Sempo: Lightweight foundation models for time series forecasting.arXiv preprint arXiv:2510.19710,

He, H., Yi, K., Ma, Y ., Zhang, Q., Niu, Z., and Pang, G. Sempo: Lightweight foundation models for time series forecasting.arXiv preprint arXiv:2510.19710,

work page arXiv
[6]

Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y ., and Zhou, Y . Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Scaling laws for downstream task performance of large language models

Isik, B., Ponomareva, N., Hazimeh, H., Paparas, D., Vassil- vitskii, S., and Koyejo, S. Scaling laws for downstream task performance of large language models. InICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models,

work page 2024
[8]

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

Jin, M., Wang, S., Ma, L., Chu, Z., Zhang, J. Y ., Shi, X., Chen, P.-Y ., Liang, Y ., Li, Y .-F., Pan, S., et al. Time-llm: Time series forecasting by reprogramming large language models.arXiv preprint arXiv:2310.01728,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[10]

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

Liu, M., Zeng, A., Chen, M., Xu, Z., Lai, Q., Ma, L., and Xu, Q. Scinet: Time series modeling and forecasting with sample convolution and interaction.Advances in Neural Information Processing Systems, 35:5816–5828, 2022a. Liu, Y ., Wu, H., Wang, J., and Long, M. Non-stationary transformers: Exploring the stationarity in time series forecasting.Advances in...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Timer: Generative pre-trained transformers are large time series models, 2024

Liu, Y ., Zhang, H., Li, C., Huang, X., Wang, J., and Long, M. Timer: Generative pre-trained transformers are large time series models.arXiv preprint arXiv:2402.02368,

work page arXiv
[12]

Sundial: A Family of Highly Capable Time Series Foundation Models

Liu, Y ., Qin, G., Shi, Z., Chen, Z., Yang, C., Huang, X., Wang, J., and Long, M. Sundial: A family of highly capable time series foundation models.arXiv preprint arXiv:2502.00816,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

Nie, Y . A time series is worth 64words: Long-term forecast- ing with transformers.arXiv preprint arXiv:2211.14730,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

and Klabjan, D

Qian, X. and Klabjan, D. The impact of the mini-batch size on the variance of gradients in stochastic gradient descent. arXiv preprint arXiv:2004.13146,

work page arXiv 2004
[15]

Time-moe: Billion-scale time series foundation models with mixture of experts.arXiv preprint arXiv:2409.16040, 2024

Shi, X., Wang, S., Nie, Y ., Li, D., Ye, Z., Wen, Q., and Jin, M. Time-moe: Billion-scale time series founda- tion models with mixture of experts.arXiv preprint arXiv:2409.16040,

work page arXiv
[16]

Scaling laws in patchification: An im- age is worth 50,176 tokens and more.arXiv preprint arXiv:2502.03738, 2025a

Wang, F., Yu, Y ., Wei, G., Shao, W., Zhou, Y ., Yuille, A., and Xie, C. Scaling laws in patchification: An im- age is worth 50,176 tokens and more.arXiv preprint arXiv:2502.03738, 2025a. Wang, H., Li, J., Wu, H., Hovy, E., and Sun, Y . Pre-trained language models and their applications.Engineering, 25: 51–65,

work page arXiv
[17]

Deep Time Series Models: A Comprehensive Survey and Benchmark

Wang, Y ., Wu, H., Dong, J., Liu, Y ., Wang, C., Long, M., and Wang, J. Deep time series models: A comprehensive survey and benchmark.arXiv preprint arXiv:2407.13278,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis

Wu, H., Hu, T., Liu, Y ., Zhou, H., Wang, J., and Long, M. Timesnet: Temporal 2d-variation modeling for general time series analysis.arXiv preprint arXiv:2210.02186,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

This proves the claimed entrywise error bound: [CZ]ℓ,ℓ′ −[C (r) Z ]ℓ,ℓ′ ≤ ∥Φ∥ 2 ∥Bℓ∥F ∥Bℓ′∥F . B. Details of Pretraining, Tuning, and Inference Stages Pretraining stage.As illustrated in Figure 7, Olivia is pretrained under a reconstruction objective, enabling it to learn domain-agnostic temporal representations from large-scale heterogeneous datasets. Gi...

work page 2025
[20]

of˜x(n) is computed as: S˜x(n) (ωk) = 1 T T−1X t=0 ˜x(n) t e−jωkt 2 ,(34) whereω k = 2πk T , k= 0,1, . . . ,⌊T /2⌋. Dataset-Level PSD Aggregation.The normalized dataset-level PSD for domain dataset D is defined as the average of individual periodograms over all sampled windows, normalized to form a probability distribution over frequencies: PD(ωk) = 1 N P...

work page 2002
[21]

between pairs of dataset-level normalized PSD distributions. Given two domain datasets Di and 16 Olivia: Harmonizing Time Series Foundation Models with Power Spectral Density Dj, let PDi (ωk) and PDj (ωk) denote their normalized dataset-level PSDs estimated as described in Appendix C.2, where ωk = 2πk/Tandk= 0,1, . . . ,⌊T /2⌋. The pairwise JS divergence ...

work page 1951
[22]

Without the Harmonizer, substantial JS divergences are observed across all domain pairs, indicating pronounced domain-level heterogeneity in spectral structure

Table 5 compares cross-domain Jensen–Shannon (JS) divergence of normalized PSD distributions before and after applying the Harmonizer, with and without RevIN. Without the Harmonizer, substantial JS divergences are observed across all domain pairs, indicating pronounced domain-level heterogeneity in spectral structure. Applying RevIN alone does not consist...

work page 2025
[23]

To ensure data quality and consistency, missing values are systematically handled using linear interpolation

is a curated collection of time series data assembled from a combination of publicly available online repositories and empirical data obtained from real-world machine operations. To ensure data quality and consistency, missing values are systematically handled using linear interpolation. All datasets follow a unified data storage format based on the Parqu...

work page 2024
[24]

All datasets are selected from the UTSD data repositories (Liu et al., 2024)

for evaluation, a large-scale 18 Olivia: Harmonizing Time Series Foundation Models with Power Spectral Density Table 7.List of pretraining datasets. All datasets are selected from the UTSD data repositories (Liu et al., 2024). Dataset attributes follow the original UTSD specifications. Dataset Domain Resolution Time Points File Size ADF. Forecast. Source ...

work page 2024
[25]

KDD Cup 2018 Nature Hourly 2.94M 12M -10.107 0.362 (Godahewa et al.,

work page 2018
[26]

and comprehensive benchmark for assessing general time series forecasting models, particularly in zero-shot settings. It comprises 23 datasets with over 144,000 time series and 177 million data points, covering seven application domains and 10 temporal frequencies, and supports multivariate inputs with forecasting horizons ranging from short- to long-term...

work page 2020
[27]

A similar pattern is observed for SEMPO, where SEMPOB (T= 512 , 6.5M parameters) often matches or outperforms its larger counterparts SEMPOE (T= 1024 , 7.3M parameters) and SEMPOA (T= 1536 , 9.9M parameters). Notably, despite using fewer parameters, Olivia variants achieve forecasting performance comparable to, and in many cases better than, SEMPO variant...

work page 2020
[28]

In these areas, larger datasets primarily improve coverage of semantic or visual variability

and computer vision (Wang et al., 2025a; Zhai et al., 2022). In these areas, larger datasets primarily improve coverage of semantic or visual variability. Compared with language and image data, time series data are generated by underlying dynamical systems and exhibit stronger temporal dependencies, domain-specific time scales, and structured spectral pro...

work page arXiv 2022
[29]

Consistent with the observations on ETTh1, Olivia achieves competitive forecasting accuracy on Weather while maintaining a compact model size (5.1M parameters). Although its inference time is higher than that of SEMPO, this overhead primarily stems from the Harmonizer module, where the orthogonal temporal transformation 28 Olivia: Harmonizing Time Series ...

work page arXiv

[1] [1]

arXiv preprint arXiv:2410.10393(2024)

Aksu, T., Woo, G., Liu, J., Liu, X., Liu, C., Savarese, S., Xiong, C., and Sahoo, D. Gift-eval: A benchmark for general time series forecasting model evaluation.arXiv preprint arXiv:2410.10393,

work page arXiv

[2] [2]

URL http://jmlr.org/papers/v21/19-820. html. Ansari, A. F., Stella, L., Turkmen, C., Zhang, X., Mercado, P., Shen, H., Shchur, O., Rangapuram, S. S., Arango, S. P., Kapoor, S., et al. Chronos: Learning the language of time series.arXiv preprint arXiv:2403.07815,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Monash time series forecasting archive

Godahewa, R., Bergmeir, C., Webb, G. I., Hyndman, R. J., and Montero-Manso, P. Monash time series forecasting archive.arXiv preprint arXiv:2105.06643,

work page arXiv

[4] [4]

Moment: A family of open time-series foundation models

Goswami, M., Szafer, K., Choudhry, A., Cai, Y ., Li, S., and Dubrawski, A. Moment: A family of open time-series foundation models.arXiv preprint arXiv:2402.03885,

work page arXiv

[5] [5]

Sempo: Lightweight foundation models for time series forecasting.arXiv preprint arXiv:2510.19710,

He, H., Yi, K., Ma, Y ., Zhang, Q., Niu, Z., and Pang, G. Sempo: Lightweight foundation models for time series forecasting.arXiv preprint arXiv:2510.19710,

work page arXiv

[6] [6]

Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y ., and Zhou, Y . Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Scaling laws for downstream task performance of large language models

Isik, B., Ponomareva, N., Hazimeh, H., Paparas, D., Vassil- vitskii, S., and Koyejo, S. Scaling laws for downstream task performance of large language models. InICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models,

work page 2024

[8] [8]

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

Jin, M., Wang, S., Ma, L., Chu, Z., Zhang, J. Y ., Shi, X., Chen, P.-Y ., Liang, Y ., Li, Y .-F., Pan, S., et al. Time-llm: Time series forecasting by reprogramming large language models.arXiv preprint arXiv:2310.01728,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[10] [10]

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

Liu, M., Zeng, A., Chen, M., Xu, Z., Lai, Q., Ma, L., and Xu, Q. Scinet: Time series modeling and forecasting with sample convolution and interaction.Advances in Neural Information Processing Systems, 35:5816–5828, 2022a. Liu, Y ., Wu, H., Wang, J., and Long, M. Non-stationary transformers: Exploring the stationarity in time series forecasting.Advances in...

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Timer: Generative pre-trained transformers are large time series models, 2024

Liu, Y ., Zhang, H., Li, C., Huang, X., Wang, J., and Long, M. Timer: Generative pre-trained transformers are large time series models.arXiv preprint arXiv:2402.02368,

work page arXiv

[12] [12]

Sundial: A Family of Highly Capable Time Series Foundation Models

Liu, Y ., Qin, G., Shi, Z., Chen, Z., Yang, C., Huang, X., Wang, J., and Long, M. Sundial: A family of highly capable time series foundation models.arXiv preprint arXiv:2502.00816,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

Nie, Y . A time series is worth 64words: Long-term forecast- ing with transformers.arXiv preprint arXiv:2211.14730,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

and Klabjan, D

Qian, X. and Klabjan, D. The impact of the mini-batch size on the variance of gradients in stochastic gradient descent. arXiv preprint arXiv:2004.13146,

work page arXiv 2004

[15] [15]

Time-moe: Billion-scale time series foundation models with mixture of experts.arXiv preprint arXiv:2409.16040, 2024

Shi, X., Wang, S., Nie, Y ., Li, D., Ye, Z., Wen, Q., and Jin, M. Time-moe: Billion-scale time series founda- tion models with mixture of experts.arXiv preprint arXiv:2409.16040,

work page arXiv

[16] [16]

Scaling laws in patchification: An im- age is worth 50,176 tokens and more.arXiv preprint arXiv:2502.03738, 2025a

Wang, F., Yu, Y ., Wei, G., Shao, W., Zhou, Y ., Yuille, A., and Xie, C. Scaling laws in patchification: An im- age is worth 50,176 tokens and more.arXiv preprint arXiv:2502.03738, 2025a. Wang, H., Li, J., Wu, H., Hovy, E., and Sun, Y . Pre-trained language models and their applications.Engineering, 25: 51–65,

work page arXiv

[17] [17]

Deep Time Series Models: A Comprehensive Survey and Benchmark

Wang, Y ., Wu, H., Dong, J., Liu, Y ., Wang, C., Long, M., and Wang, J. Deep time series models: A comprehensive survey and benchmark.arXiv preprint arXiv:2407.13278,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis

Wu, H., Hu, T., Liu, Y ., Zhou, H., Wang, J., and Long, M. Timesnet: Temporal 2d-variation modeling for general time series analysis.arXiv preprint arXiv:2210.02186,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

This proves the claimed entrywise error bound: [CZ]ℓ,ℓ′ −[C (r) Z ]ℓ,ℓ′ ≤ ∥Φ∥ 2 ∥Bℓ∥F ∥Bℓ′∥F . B. Details of Pretraining, Tuning, and Inference Stages Pretraining stage.As illustrated in Figure 7, Olivia is pretrained under a reconstruction objective, enabling it to learn domain-agnostic temporal representations from large-scale heterogeneous datasets. Gi...

work page 2025

[20] [20]

of˜x(n) is computed as: S˜x(n) (ωk) = 1 T T−1X t=0 ˜x(n) t e−jωkt 2 ,(34) whereω k = 2πk T , k= 0,1, . . . ,⌊T /2⌋. Dataset-Level PSD Aggregation.The normalized dataset-level PSD for domain dataset D is defined as the average of individual periodograms over all sampled windows, normalized to form a probability distribution over frequencies: PD(ωk) = 1 N P...

work page 2002

[21] [21]

between pairs of dataset-level normalized PSD distributions. Given two domain datasets Di and 16 Olivia: Harmonizing Time Series Foundation Models with Power Spectral Density Dj, let PDi (ωk) and PDj (ωk) denote their normalized dataset-level PSDs estimated as described in Appendix C.2, where ωk = 2πk/Tandk= 0,1, . . . ,⌊T /2⌋. The pairwise JS divergence ...

work page 1951

[22] [22]

Without the Harmonizer, substantial JS divergences are observed across all domain pairs, indicating pronounced domain-level heterogeneity in spectral structure

Table 5 compares cross-domain Jensen–Shannon (JS) divergence of normalized PSD distributions before and after applying the Harmonizer, with and without RevIN. Without the Harmonizer, substantial JS divergences are observed across all domain pairs, indicating pronounced domain-level heterogeneity in spectral structure. Applying RevIN alone does not consist...

work page 2025

[23] [23]

To ensure data quality and consistency, missing values are systematically handled using linear interpolation

is a curated collection of time series data assembled from a combination of publicly available online repositories and empirical data obtained from real-world machine operations. To ensure data quality and consistency, missing values are systematically handled using linear interpolation. All datasets follow a unified data storage format based on the Parqu...

work page 2024

[24] [24]

All datasets are selected from the UTSD data repositories (Liu et al., 2024)

for evaluation, a large-scale 18 Olivia: Harmonizing Time Series Foundation Models with Power Spectral Density Table 7.List of pretraining datasets. All datasets are selected from the UTSD data repositories (Liu et al., 2024). Dataset attributes follow the original UTSD specifications. Dataset Domain Resolution Time Points File Size ADF. Forecast. Source ...

work page 2024

[25] [25]

KDD Cup 2018 Nature Hourly 2.94M 12M -10.107 0.362 (Godahewa et al.,

work page 2018

[26] [26]

and comprehensive benchmark for assessing general time series forecasting models, particularly in zero-shot settings. It comprises 23 datasets with over 144,000 time series and 177 million data points, covering seven application domains and 10 temporal frequencies, and supports multivariate inputs with forecasting horizons ranging from short- to long-term...

work page 2020

[27] [27]

A similar pattern is observed for SEMPO, where SEMPOB (T= 512 , 6.5M parameters) often matches or outperforms its larger counterparts SEMPOE (T= 1024 , 7.3M parameters) and SEMPOA (T= 1536 , 9.9M parameters). Notably, despite using fewer parameters, Olivia variants achieve forecasting performance comparable to, and in many cases better than, SEMPO variant...

work page 2020

[28] [28]

In these areas, larger datasets primarily improve coverage of semantic or visual variability

and computer vision (Wang et al., 2025a; Zhai et al., 2022). In these areas, larger datasets primarily improve coverage of semantic or visual variability. Compared with language and image data, time series data are generated by underlying dynamical systems and exhibit stronger temporal dependencies, domain-specific time scales, and structured spectral pro...

work page arXiv 2022

[29] [29]

Consistent with the observations on ETTh1, Olivia achieves competitive forecasting accuracy on Weather while maintaining a compact model size (5.1M parameters). Although its inference time is higher than that of SEMPO, this overhead primarily stems from the Harmonizer module, where the orthogonal temporal transformation 28 Olivia: Harmonizing Time Series ...

work page arXiv