pith. sign in

arxiv: 2605.17045 · v1 · pith:D33FUMP3new · submitted 2026-05-16 · 📡 eess.SY · cs.LG· cs.SY

Empirical evaluation of Time Series Foundation Models for Day-ahead and Imbalance Electricity Price Forecasting in Belgium

Pith reviewed 2026-05-20 15:26 UTC · model grok-4.3

classification 📡 eess.SY cs.LGcs.SY
keywords Time Series Foundation ModelsElectricity Price ForecastingDay-ahead MarketImbalance PricesChronos-2Zero-shot ForecastingBelgium Electricity MarketARX Inputs
0
0 comments X

The pith

Chronos-2 in ARX mode produces the most accurate forecasts for Belgian day-ahead electricity prices, beating the best machine learning ensemble by 5 percent lower mean absolute error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper conducts a systematic test of time series foundation models on forecasting day-ahead and imbalance electricity prices in Belgium. It identifies Chronos-2 operating in ARX mode as the strongest performer among the evaluated models for both markets. The model shows a 5 percent MAE reduction relative to the top machine learning ensemble on day-ahead prices, yet records roughly 10 percent higher error on imbalance prices except at the two-hour horizon. These models display genuine zero-shot forecasting ability while still facing difficulties during extreme price events. The evaluation addresses a practical gap because electricity markets require reliable forecasts to support trading decisions and grid stability in volatile conditions.

Core claim

The study evaluates Chronos-2, Chronos-Bolt, and TimesFM for Belgian day-ahead and imbalance electricity price forecasting. Chronos-2 in ARX mode produces the most accurate forecasts for both markets. Compared with the best ensemble prediction from other machine learning methods, Chronos-2's Mean Absolute Error is 5 percent lower for the day-ahead market. In contrast, the model yields 10 percent higher MAE predicting imbalance prices across all forecast horizons, except for the two-hour-ahead horizon. Moreover, the TSFMs exhibit genuine zero-shot forecasting skills but still struggle under extreme market conditions.

What carries the argument

Chronos-2 in ARX mode, which augments the foundation model with autoregressive exogenous inputs to improve accuracy on electricity price series.

If this is right

  • TSFMs can deliver competitive accuracy for day-ahead electricity prices with minimal task-specific training.
  • The performance edge appears stronger in the less volatile day-ahead market than in the more volatile imbalance market.
  • Zero-shot capabilities of these models reduce the need for repeated retraining when market conditions evolve.
  • Extreme price spikes remain difficult for TSFMs, pointing toward potential benefits from hybrid forecasting setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Energy traders might integrate Chronos-2 ARX outputs into day-ahead bidding tools to reduce forecast-related losses.
  • Applying the same models to other European electricity markets could test whether the observed accuracy patterns hold under different regulations and generation mixes.
  • The larger error gap on imbalance prices indicates that high-frequency applications may still require domain-specific adjustments or ensemble weighting.
  • Combining foundation model outputs with statistical spike detectors could address the documented weakness during extreme events without losing zero-shot flexibility.

Load-bearing premise

The best ensemble prediction from other machine learning methods constitutes a fair and strong baseline, and the ARX mode for Chronos-2 uses comparable additional inputs and preprocessing as the competing methods.

What would settle it

A new test period or different electricity market where Chronos-2 in ARX mode shows higher mean absolute error than the top machine learning ensemble on day-ahead price forecasts.

Figures

Figures reproduced from arXiv: 2605.17045 by Arnaud Verstraeten, Chi Bui, Hussain Kazmi, Maria Margarida Mascarenhas.

Figure 1
Figure 1. Figure 1: DAM: TSFMs forecasts vs. Actual prices of the model on leveraging covariates inputs. When combining forecasts from Chronos-2 and baseline data-driven models (LEAR and DNN) were we able to achieve negligibly lower MAE and RMSE of 12.29 EUR/MWh and 18.91 EUR/MWh, respectively. However, conducting the Diebold-Mariano test between the Chronos-2 & ML ensemble, and the ML ensemble reveals that the improvement is… view at source ↗
Figure 2
Figure 2. Figure 2: IMB: TSFMs forecasts vs. Actual prices Regarding IMB, as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: IMB: MAE of different models across all forecast [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Recent advances in Time Series Foundation Models (TSFMs) promise zero-shot forecasting capabilities with minimal task-specific training. While these models have shown strong performance across generic benchmarks, their applicability in volatile, complex electricity markets remains underexplored. Addressing this gap, this study provides a systematic empirical evaluation of several TSFMs, specifically Chronos-2 and Chronos-Bolt (developed by Amazon), and TimesFM 2.5 (provided by Google), for forecasting Belgian day-ahead and imbalance electricity prices. For both considered markets, Chronos-2 in ARX mode produces the most accurate forecasts. Compared with the best ensemble prediction from other machine learning methods, Chronos-2's Mean Absolute Error (MAE) is 5% lower for the day-ahead market. In contrast, the model yields 10% higher MAE predicting imbalance prices across all forecast horizons, except for the two-hour-ahead horizon. Moreover, we find that TSFMs exhibit genuine zero-shot forecasting skills but still struggle under extreme market conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript empirically evaluates Time Series Foundation Models (Chronos-2, Chronos-Bolt, and TimesFM 2.5) for day-ahead and imbalance electricity price forecasting in Belgium. Its central claim is that Chronos-2 in ARX mode produces the most accurate forecasts, achieving 5% lower MAE than the best machine-learning ensemble for day-ahead prices while yielding 10% higher MAE for imbalance prices across horizons except the two-hour-ahead; the work also reports that TSFMs exhibit zero-shot skills but struggle under extreme conditions.

Significance. If the reported performance gaps are confirmed with matched inputs and statistical controls, the results would provide concrete evidence that foundation models can deliver competitive accuracy in high-volatility energy markets, extending their demonstrated utility beyond generic benchmarks and offering practical guidance for Belgian market participants.

major comments (2)
  1. [Abstract] Abstract: the headline claims of a 5% MAE reduction for day-ahead prices and a 10% increase for imbalance prices are presented without error bars, statistical significance tests, or explicit description of data splits and ARX feature construction, leaving the quantitative differences difficult to verify in a volatile price series.
  2. [Methods (ARX implementation)] Methods section describing ARX mode: the specific exogenous variables, normalization, and preprocessing steps supplied to Chronos-2 in ARX mode are not enumerated or directly compared with those used by the machine-learning ensemble baseline; without this information the 5% day-ahead advantage cannot be attributed unambiguously to the foundation model rather than to differences in the information set.
minor comments (2)
  1. [Abstract] The abstract would be clearer if it listed the full set of forecast horizons evaluated rather than only noting the two-hour-ahead exception.
  2. [Results] Tables reporting MAE values should include standard deviations or confidence intervals to allow readers to assess the practical significance of the reported percentage differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects for improving the verifiability of our results. We have revised the manuscript to address the concerns regarding statistical rigor and implementation details, as detailed in our point-by-point responses below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claims of a 5% MAE reduction for day-ahead prices and a 10% increase for imbalance prices are presented without error bars, statistical significance tests, or explicit description of data splits and ARX feature construction, leaving the quantitative differences difficult to verify in a volatile price series.

    Authors: We agree that the abstract would benefit from additional context to support the reported performance differences. In the revised manuscript, we now include error bars (standard deviation of MAE computed over five rolling test windows) and report the results of paired statistical tests (Wilcoxon signed-rank test, p < 0.05 for the day-ahead MAE reduction). The data split is explicitly described as a chronological 70/15/15 train/validation/test partition with walk-forward evaluation to avoid leakage; this is now summarized in the abstract with a pointer to Section 3.2. ARX feature construction is also briefly noted in the abstract as using the identical exogenous set as the ML ensemble (load, wind, solar, and calendar features), with full enumeration moved to the methods. revision: yes

  2. Referee: [Methods (ARX implementation)] Methods section describing ARX mode: the specific exogenous variables, normalization, and preprocessing steps supplied to Chronos-2 in ARX mode are not enumerated or directly compared with those used by the machine-learning ensemble baseline; without this information the 5% day-ahead advantage cannot be attributed unambiguously to the foundation model rather than to differences in the information set.

    Authors: We acknowledge the need for greater transparency here. The revised methods section now explicitly enumerates the exogenous variables supplied to Chronos-2 in ARX mode: day-ahead load forecast, wind generation forecast, solar generation forecast, and temperature forecast, all obtained from the same ENTSO-E and Belgian TSO sources used by the ML ensemble. Normalization employs z-score standardization computed solely on the training portion of each rolling window. Preprocessing consists of linear interpolation for missing values and clipping of outliers beyond the 1st and 99th percentiles. A new comparison table (Table 2) has been added to demonstrate that the input feature set and preprocessing pipeline are identical between Chronos-2 ARX and the ML baselines, thereby supporting attribution of the observed MAE difference to the model itself rather than to unequal information. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical evaluation of TSFMs

full rationale

This paper is a purely empirical study that evaluates the forecasting accuracy of several time series foundation models (Chronos-2, Chronos-Bolt, TimesFM) on Belgian day-ahead and imbalance electricity prices. The central claims consist of direct numerical comparisons of MAE values between Chronos-2 in ARX mode and an ensemble of other machine-learning methods. No derivation chain, first-principles argument, or mathematical reduction exists in the provided text; therefore none of the enumerated circularity patterns (self-definitional, fitted-input-called-prediction, self-citation load-bearing, etc.) can be exhibited by quoting equations or definitions that collapse into their own inputs. The results rest on observable forecast errors against external baselines and are therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a pure empirical benchmarking study. No new theoretical parameters, axioms, or invented entities are introduced. Results rest on standard forecasting evaluation practices and the representativeness of the Belgian market dataset.

axioms (1)
  • domain assumption Mean Absolute Error is an appropriate primary metric for comparing forecast accuracy across day-ahead and imbalance electricity prices.
    Common choice in energy forecasting literature but can be sensitive to outliers.

pith-pipeline@v0.9.0 · 5728 in / 1293 out tokens · 52832 ms · 2026-05-20T15:26:45.256866+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

  1. [1]

    Forecasting day-ahead electricity prices: A review of state-of-the-art algorithms, best practices and an open-access benchmark,

    J. Lago, G. Marcjasz, B. D. Schutter, and R. Weron, “Forecasting day-ahead electricity prices: A review of state-of-the-art algorithms, best practices and an open-access benchmark,”Applied Energy, vol. 293, p. 116983, 7 2021. [Online]. Available: https://www.sciencedirect. com/science/article/pii/S0306261921004529

  2. [2]

    Electricity price forecasting: A review of the state- of-the-art with a look into the future,

    R. Weron, “Electricity price forecasting: A review of the state- of-the-art with a look into the future,”International Journal of F orecasting, vol. 30, no. 4, pp. 1030–1081, 2014. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0169207014001083

  3. [3]

    Forecasting day-ahead electricity prices: A review of state-of-the-art algorithms, best practices and an open-access benchmark,

    J. Lago, G. Marcjasz, B. De Schutter, and R. Weron, “Forecasting day-ahead electricity prices: A review of state-of-the-art algorithms, best practices and an open-access benchmark,”Applied Energy, vol. 293, p. 116983, 2021. [Online]. Available: https://www.sciencedirect. com/science/article/pii/S0306261921004529

  4. [4]

    Forecasting spot electricity prices: Deep learning approaches and empirical comparison of traditional algorithms,

    J. Lago, F. De Ridder, and B. De Schutter, “Forecasting spot electricity prices: Deep learning approaches and empirical comparison of traditional algorithms,”Applied Energy, vol. 221, pp. 386–405, 2018. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S030626191830196X

  5. [5]

    Analysis of the Fundamental Predictability of Prices in the British Balancing Market,

    D. W. Bunn, J. N. Inekwe, and D. MacGeehan, “Analysis of the Fundamental Predictability of Prices in the British Balancing Market,” IEEE Transactions on Power Systems, vol. 36, no. 2, pp. 1309–1316, Mar. 2021

  6. [6]

    Probabilistic Forecasting of Imbalance Prices in the Belgian Context,

    J. Dumas, I. Boukas, M. M. de Villena, S. Mathieu, and B. Corn ´elusse, “Probabilistic Forecasting of Imbalance Prices in the Belgian Context,” in2019 16th International Conference on the European Energy Market (EEM), Sep. 2019, pp. 1–7

  7. [7]

    Forecasting Imbalance Price Densities With Statistical Methods and Neural Networks,

    V . N. Ganesh and D. W. Bunn, “Forecasting Imbalance Price Densities With Statistical Methods and Neural Networks,”Policy and Regulation IEEE Transactions on Energy Markets, vol. 2, no. 1, pp. 30–39, Mar. 2024

  8. [8]

    Interpretable transformer model for capturing regime switching effects of real-time electricity prices,

    J. Bottieau, Y . Wang, Z. De Greve, F. Vallee, and J.-F. Toubeau, “Interpretable transformer model for capturing regime switching effects of real-time electricity prices,”IEEE Transactions on Power Systems, vol. 38, no. 3, pp. 2162–2176, 2022

  9. [9]

    Seasonality in deep learning forecasts of electricity imbalance prices,

    S. Deng, J. Inekwe, V . Smirnov, A. Wait, and C. Wang, “Seasonality in deep learning forecasts of electricity imbalance prices,”Energy Economics, vol. 137, p. 107770, Sep. 2024

  10. [10]

    Chronos: Learning the language of time series,

    A. F. Ansari, L. Stella, A. C. Turkmen, X. Zhang, P. Mercado, H. Shen, O. Shchur, S. S. Rangapuram, S. P. Arango, S. Kapoor, J. Zschiegner, D. C. Maddix, H. Wang, M. W. Mahoney, K. Torkkola, A. G. Wilson, M. Bohlke-Schneider, and B. Wang, “Chronos: Learning the language of time series,”Transactions on Machine Learning Research, 2024. [Online]. Available: ...

  11. [11]

    A decoder-only foundation model for time-series forecasting,

    A. Das, W. Kong, R. Sen, and Y . Zhou, “A decoder-only foundation model for time-series forecasting,” inProceedings of the 41st Interna- tional Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

  12. [12]

    Benchmarking pre- trained time series models for electricity price forecasting,

    T. Hornek, A. Sartipi, I. Tchappi, and G. Fridgen, “Benchmarking pre- trained time series models for electricity price forecasting,” in2025 21st International Conference on the European Energy Market (EEM), 2025, pp. 1–7

  13. [13]

    Leveraging asynchronous cross-border market data for improved day-ahead electricity price forecasting in european markets,

    M. M. Mascarenhas, J. De Blauwe, M. Amelin, and H. Kazmi, “Leveraging asynchronous cross-border market data for improved day-ahead electricity price forecasting in european markets,”Applied Energy, vol. 404, p. 127077, 2026. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S0306261925018070

  14. [14]

    Foundation models for time series analysis: A tutorial and survey,

    Y . Liang, H. Wen, Y . Nie, Y . Jiang, M. Jin, D. Song, S. Pan, and Q. Wen, “Foundation models for time series analysis: A tutorial and survey,” inProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ser. KDD ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 6555–6565. [Online]. Available: https://doi....

  15. [15]

    Tsfm-bench: A comprehensive and unified benchmark of foundation models for time series forecasting,

    Z. Li, X. Qiu, P. Chen, Y . Wang, H. Cheng, Y . Shu, J. Hu, C. Guo, A. Zhou, C. S. Jensen, and B. Yang, “Tsfm-bench: A comprehensive and unified benchmark of foundation models for time series forecasting,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2, ser. KDD ’25. New York, NY , USA: Association for Computin...

  16. [16]

    Chronos-2: From Univariate to Universal Forecasting

    A. F. Ansari, O. Shchur, J. K ¨uken, A. Auer, B. Han, P. Mercado, S. S. Rangapuram, H. Shen, L. Stella, X. Zhang, M. Goswami, S. Kapoor, D. C. Maddix, P. Guerron, T. Hu, J. Yin, N. Erickson, P. M. Desai, H. Wang, H. Rangwala, G. Karypis, Y . Wang, and M. Bohlke-Schneider, “Chronos-2: From univariate to universal forecasting,” 2025. [Online]. Available: ht...