Olivia: Harmonizing Time Series Foundation Models with Power Spectral Density
Pith reviewed 2026-05-20 14:37 UTC · model grok-4.3
The pith
Olivia harmonizes time series datasets through power spectral density matching to produce stronger transferable representations for foundation model pretraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Harmonizing datasets via PSDs in the spectral domain reduces mismatches and enhances pretraining effectiveness. The Harmonizer module reshapes spectral structures and implicitly harmonizes PSDs across datasets, which theoretically corresponds to a shared reparameterization of second-order temporal correlations. Token interactions with the Harmonizer can be efficiently mediated by a compact set of resonators, motivating HarmonicAttention that performs self-attention in a low-dimensional interaction space. Olivia, built on these mechanisms, achieves state-of-the-art performance on TSLib, GIFT-Eval, and six additional GluonTS datasets under zero-shot, few-shot, and full-shot forecasting.
What carries the argument
The Harmonizer module, which reshapes spectral structures to implicitly harmonize normalized power spectral densities and thereby reparameterize second-order temporal correlations across datasets.
If this is right
- Harmonizer enables more effective pretraining on heterogeneous collections by aligning spectral properties without explicit pairwise optimization.
- HarmonicAttention reduces token interaction complexity to a low-dimensional resonator space while preserving the benefits of the harmonized representations.
- The same reparameterization of second-order correlations supports consistent gains in zero-shot, few-shot, and full-shot regimes on both TSLib and GIFT-Eval.
- The approach generalizes across six additional GluonTS datasets without domain-specific retraining.
Where Pith is reading between the lines
- The resonator-based attention could be tested for transfer to non-time-series sequence tasks that also exhibit frequency structure.
- If the spectral alignment proves robust, the same Harmonizer could be inserted into existing foundation models to improve their cross-domain performance with minimal added cost.
- A direct comparison of learned embeddings before and after Harmonizer application on mixed-domain batches would quantify how much the second-order correlation reparameterization actually reduces distribution shift.
Load-bearing premise
The assumption that harmonizing datasets via PSDs in the spectral domain reduces mismatches and enhances pretraining effectiveness.
What would settle it
Training an otherwise identical foundation model without the Harmonizer on the TSLib benchmark and measuring whether zero-shot and few-shot forecasting metrics remain equal to or exceed those reported for Olivia would directly test the contribution of the PSD harmonization step.
Figures
read the original abstract
Time series foundation models rely on large-scale pretraining over diverse datasets across domains, yet their heterogeneity in temporal patterns could hinder the effectiveness of training and learning transferable time series representations. Inspired a fundamental concept, normalized power spectral density (PSD) in signal processing, we assume harmonizing datasets via PSDs in the spectral domain could reduce mismatches and enhance pretraining. We then go beyond the direct intractable minimization optimization and innovatively reformulate it as a principled harmonization approach. Specifically, we propose Harmonizer, a module that reshapes spectral structures and implicitly harmonizing PSDs across datasets, which theoretically corresponds to a shared reparameterization of second-order temporal correlations. Our theoretical analysis further reveals token interactions with Harmonizer can be efficiently mediated by a compact set of resonators, motivating a HarmonicAttention design that performs self-attention in a low-dimensional interaction space. Then, we propose Olivia, a novel time series foundation model built upon these harmonization mechanisms. Extensive experiments on two large-scale benchmarks (TSLib and GIFT-Eval) and extra 6 datasets from GluonTS, demonstrate Olivia consistently achieves state-of-the-art performance under zero-shot, few-shot, and full-shot forecasting scenarios. Our code is available at https://github.com/TSTS13/Olivia.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Olivia, a time series foundation model that addresses heterogeneity across pretraining datasets by proposing a Harmonizer module to reshape spectral structures and implicitly harmonize normalized power spectral densities (PSDs). This is presented as a principled reformulation corresponding to a shared reparameterization of second-order temporal correlations. The work further derives HarmonicAttention, in which token interactions are mediated by a compact set of resonators in a low-dimensional space. Extensive experiments claim state-of-the-art zero-shot, few-shot, and full-shot forecasting performance on the TSLib and GIFT-Eval benchmarks plus six additional GluonTS datasets.
Significance. If the central theoretical correspondence between PSD reshaping and second-order correlation reparameterization is rigorously derived and if controlled experiments demonstrate that the harmonization mechanism (rather than scale or the resonator attention alone) drives the reported gains, the approach could provide a signal-processing-inspired route to more robust transferable representations in time-series foundation models. The resonator-based efficiency argument and the explicit link to PSD harmonization would constitute a distinctive contribution.
major comments (2)
- [Abstract and §3] Abstract and §3 (Harmonizer design): The claim that the Harmonizer 'implicitly harmonizing PSDs across datasets' and 'theoretically corresponds to a shared reparameterization of second-order temporal correlations' is load-bearing for attributing performance gains to the proposed mechanism. No derivation, explicit equations, or quantitative diagnostics (pre-/post-harmonization PSD divergence, Wasserstein distance on spectra, or controlled mismatch metrics) are supplied to verify that the reshaping actually reduces cross-dataset spectral mismatch.
- [§5] §5 (Experiments): The manuscript reports consistent SOTA results across zero-/few-/full-shot regimes on TSLib, GIFT-Eval, and GluonTS. However, the absence of ablations that disable the Harmonizer while retaining HarmonicAttention and identical training scale leaves open the possibility that gains arise from the resonator attention or data scale rather than from PSD harmonization. A single controlled comparison isolating the harmonization component is required to support the central causal claim.
minor comments (2)
- The GitHub repository link is welcome; the released code should include exact hyper-parameters, random seeds, and the precise preprocessing pipelines used for the six additional GluonTS datasets to ensure full reproducibility.
- [§4] Notation for the resonator set and the low-dimensional interaction space in the HarmonicAttention derivation should be introduced with an explicit dimension-reduction equation (e.g., relating resonator count K to original sequence length) to clarify the claimed efficiency gain.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive comments, which help clarify the presentation of our theoretical claims and experimental validation. We address each major point below and commit to revisions that strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Harmonizer design): The claim that the Harmonizer 'implicitly harmonizing PSDs across datasets' and 'theoretically corresponds to a shared reparameterization of second-order temporal correlations' is load-bearing for attributing performance gains to the proposed mechanism. No derivation, explicit equations, or quantitative diagnostics (pre-/post-harmonization PSD divergence, Wasserstein distance on spectra, or controlled mismatch metrics) are supplied to verify that the reshaping actually reduces cross-dataset spectral mismatch.
Authors: We appreciate the referee's focus on rigorous verification of the central claim. Section 3 derives the correspondence by noting that normalized PSD reshaping via the Harmonizer is equivalent to reparameterizing the autocorrelation function (by the Wiener-Khinchin theorem), which governs second-order temporal correlations; this is presented as a shared reparameterization across datasets. To make the argument fully explicit and address the concern, we will insert the complete step-by-step derivation with all intermediate equations in the revised §3. We will also add quantitative diagnostics, including pre- and post-harmonization PSD divergence (e.g., Wasserstein distance between spectral distributions) and cross-dataset mismatch metrics, to empirically confirm the reduction in spectral heterogeneity. revision: yes
-
Referee: [§5] §5 (Experiments): The manuscript reports consistent SOTA results across zero-/few-/full-shot regimes on TSLib, GIFT-Eval, and GluonTS. However, the absence of ablations that disable the Harmonizer while retaining HarmonicAttention and identical training scale leaves open the possibility that gains arise from the resonator attention or data scale rather than from PSD harmonization. A single controlled comparison isolating the harmonization component is required to support the central causal claim.
Authors: We agree that isolating the Harmonizer's contribution is necessary to support the causal attribution. The current experiments demonstrate overall gains but do not include the requested controlled ablation. In the revision we will add an ablation that removes the Harmonizer while retaining HarmonicAttention, identical model scale, and training regime, thereby directly comparing performance with and without the PSD-harmonization mechanism to confirm its role in the reported improvements. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper starts from an explicit assumption that PSD-based harmonization reduces dataset mismatches, then describes a reformulation of an intractable objective into the Harmonizer module whose action is defined to reshape spectral structures and thereby implicitly harmonize PSDs. This correspondence is presented as a theoretical consequence of the module's design rather than an independent derivation that reduces to the input data or a fitted parameter. The subsequent HarmonicAttention and Olivia model are motivated by this construction but do not claim to predict benchmark metrics by construction; reported SOTA results are empirical evaluations on TSLib, GIFT-Eval, and GluonTS. No self-citation chains, uniqueness theorems from prior author work, or renaming of known results appear as load-bearing steps in the provided abstract and motivation. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Harmonizing datasets via normalized PSDs in the spectral domain reduces mismatches and enhances pretraining effectiveness.
invented entities (2)
-
Harmonizer module
no independent evidence
-
HarmonicAttention with resonators
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Harmonizer introduces learnable orthogonal temporal transformations that reorganize temporal dynamics, reshaping spectral structures and implicitly harmonizing PSDs across datasets. From a theoretical perspective, it corresponds to a shared reparameterization of second-order temporal correlations
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Proposition 1 ... shared orthogonal matrix Q ... block-diagonal moment matrix ΣX_D = diag(Λ_D, Φ_D)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2410.10393(2024)
Aksu, T., Woo, G., Liu, J., Liu, X., Liu, C., Savarese, S., Xiong, C., and Sahoo, D. Gift-eval: A benchmark for general time series forecasting model evaluation.arXiv preprint arXiv:2410.10393,
-
[2]
URL http://jmlr.org/papers/v21/19-820. html. Ansari, A. F., Stella, L., Turkmen, C., Zhang, X., Mercado, P., Shen, H., Shchur, O., Rangapuram, S. S., Arango, S. P., Kapoor, S., et al. Chronos: Learning the language of time series.arXiv preprint arXiv:2403.07815,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Monash time series forecasting archive
Godahewa, R., Bergmeir, C., Webb, G. I., Hyndman, R. J., and Montero-Manso, P. Monash time series forecasting archive.arXiv preprint arXiv:2105.06643,
-
[4]
Moment: A family of open time-series foundation models
Goswami, M., Szafer, K., Choudhry, A., Cai, Y ., Li, S., and Dubrawski, A. Moment: A family of open time-series foundation models.arXiv preprint arXiv:2402.03885,
-
[5]
Sempo: Lightweight foundation models for time series forecasting.arXiv preprint arXiv:2510.19710,
He, H., Yi, K., Ma, Y ., Zhang, Q., Niu, Z., and Pang, G. Sempo: Lightweight foundation models for time series forecasting.arXiv preprint arXiv:2510.19710,
-
[6]
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y ., and Zhou, Y . Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Scaling laws for downstream task performance of large language models
Isik, B., Ponomareva, N., Hazimeh, H., Paparas, D., Vassil- vitskii, S., and Koyejo, S. Scaling laws for downstream task performance of large language models. InICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models,
work page 2024
-
[8]
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models
Jin, M., Wang, S., Ma, L., Chu, Z., Zhang, J. Y ., Shi, X., Chen, P.-Y ., Liang, Y ., Li, Y .-F., Pan, S., et al. Time-llm: Time series forecasting by reprogramming large language models.arXiv preprint arXiv:2310.01728,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Scaling Laws for Neural Language Models
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[10]
iTransformer: Inverted Transformers Are Effective for Time Series Forecasting
Liu, M., Zeng, A., Chen, M., Xu, Z., Lai, Q., Ma, L., and Xu, Q. Scinet: Time series modeling and forecasting with sample convolution and interaction.Advances in Neural Information Processing Systems, 35:5816–5828, 2022a. Liu, Y ., Wu, H., Wang, J., and Long, M. Non-stationary transformers: Exploring the stationarity in time series forecasting.Advances in...
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Timer: Generative pre-trained transformers are large time series models, 2024
Liu, Y ., Zhang, H., Li, C., Huang, X., Wang, J., and Long, M. Timer: Generative pre-trained transformers are large time series models.arXiv preprint arXiv:2402.02368,
-
[12]
Sundial: A Family of Highly Capable Time Series Foundation Models
Liu, Y ., Qin, G., Shi, Z., Chen, Z., Yang, C., Huang, X., Wang, J., and Long, M. Sundial: A family of highly capable time series foundation models.arXiv preprint arXiv:2502.00816,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
Nie, Y . A time series is worth 64words: Long-term forecast- ing with transformers.arXiv preprint arXiv:2211.14730,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Qian, X. and Klabjan, D. The impact of the mini-batch size on the variance of gradients in stochastic gradient descent. arXiv preprint arXiv:2004.13146,
-
[15]
Shi, X., Wang, S., Nie, Y ., Li, D., Ye, Z., Wen, Q., and Jin, M. Time-moe: Billion-scale time series founda- tion models with mixture of experts.arXiv preprint arXiv:2409.16040,
-
[16]
Wang, F., Yu, Y ., Wei, G., Shao, W., Zhou, Y ., Yuille, A., and Xie, C. Scaling laws in patchification: An im- age is worth 50,176 tokens and more.arXiv preprint arXiv:2502.03738, 2025a. Wang, H., Li, J., Wu, H., Hovy, E., and Sun, Y . Pre-trained language models and their applications.Engineering, 25: 51–65,
-
[17]
Deep Time Series Models: A Comprehensive Survey and Benchmark
Wang, Y ., Wu, H., Dong, J., Liu, Y ., Wang, C., Long, M., and Wang, J. Deep time series models: A comprehensive survey and benchmark.arXiv preprint arXiv:2407.13278,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis
Wu, H., Hu, T., Liu, Y ., Zhou, H., Wang, J., and Long, M. Timesnet: Temporal 2d-variation modeling for general time series analysis.arXiv preprint arXiv:2210.02186,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
This proves the claimed entrywise error bound: [CZ]ℓ,ℓ′ −[C (r) Z ]ℓ,ℓ′ ≤ ∥Φ∥ 2 ∥Bℓ∥F ∥Bℓ′∥F . B. Details of Pretraining, Tuning, and Inference Stages Pretraining stage.As illustrated in Figure 7, Olivia is pretrained under a reconstruction objective, enabling it to learn domain-agnostic temporal representations from large-scale heterogeneous datasets. Gi...
work page 2025
-
[20]
of˜x(n) is computed as: S˜x(n) (ωk) = 1 T T−1X t=0 ˜x(n) t e−jωkt 2 ,(34) whereω k = 2πk T , k= 0,1, . . . ,⌊T /2⌋. Dataset-Level PSD Aggregation.The normalized dataset-level PSD for domain dataset D is defined as the average of individual periodograms over all sampled windows, normalized to form a probability distribution over frequencies: PD(ωk) = 1 N P...
work page 2002
-
[21]
between pairs of dataset-level normalized PSD distributions. Given two domain datasets Di and 16 Olivia: Harmonizing Time Series Foundation Models with Power Spectral Density Dj, let PDi (ωk) and PDj (ωk) denote their normalized dataset-level PSDs estimated as described in Appendix C.2, where ωk = 2πk/Tandk= 0,1, . . . ,⌊T /2⌋. The pairwise JS divergence ...
work page 1951
-
[22]
Table 5 compares cross-domain Jensen–Shannon (JS) divergence of normalized PSD distributions before and after applying the Harmonizer, with and without RevIN. Without the Harmonizer, substantial JS divergences are observed across all domain pairs, indicating pronounced domain-level heterogeneity in spectral structure. Applying RevIN alone does not consist...
work page 2025
-
[23]
is a curated collection of time series data assembled from a combination of publicly available online repositories and empirical data obtained from real-world machine operations. To ensure data quality and consistency, missing values are systematically handled using linear interpolation. All datasets follow a unified data storage format based on the Parqu...
work page 2024
-
[24]
All datasets are selected from the UTSD data repositories (Liu et al., 2024)
for evaluation, a large-scale 18 Olivia: Harmonizing Time Series Foundation Models with Power Spectral Density Table 7.List of pretraining datasets. All datasets are selected from the UTSD data repositories (Liu et al., 2024). Dataset attributes follow the original UTSD specifications. Dataset Domain Resolution Time Points File Size ADF. Forecast. Source ...
work page 2024
-
[25]
KDD Cup 2018 Nature Hourly 2.94M 12M -10.107 0.362 (Godahewa et al.,
work page 2018
-
[26]
and comprehensive benchmark for assessing general time series forecasting models, particularly in zero-shot settings. It comprises 23 datasets with over 144,000 time series and 177 million data points, covering seven application domains and 10 temporal frequencies, and supports multivariate inputs with forecasting horizons ranging from short- to long-term...
work page 2020
-
[27]
A similar pattern is observed for SEMPO, where SEMPOB (T= 512 , 6.5M parameters) often matches or outperforms its larger counterparts SEMPOE (T= 1024 , 7.3M parameters) and SEMPOA (T= 1536 , 9.9M parameters). Notably, despite using fewer parameters, Olivia variants achieve forecasting performance comparable to, and in many cases better than, SEMPO variant...
work page 2020
-
[28]
In these areas, larger datasets primarily improve coverage of semantic or visual variability
and computer vision (Wang et al., 2025a; Zhai et al., 2022). In these areas, larger datasets primarily improve coverage of semantic or visual variability. Compared with language and image data, time series data are generated by underlying dynamical systems and exhibit stronger temporal dependencies, domain-specific time scales, and structured spectral pro...
-
[29]
Consistent with the observations on ETTh1, Olivia achieves competitive forecasting accuracy on Weather while maintaining a compact model size (5.1M parameters). Although its inference time is higher than that of SEMPO, this overhead primarily stems from the Harmonizer module, where the orthogonal temporal transformation 28 Olivia: Harmonizing Time Series ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.