Evaluation of medium range machine learning models for sub-seasonal prediction

Catherine de Burgh-Day; Chen Li; Debra Hudson; Griffith Young; Harrison Cook; Li Shi; Robin Wedd

arxiv: 2606.25417 · v1 · pith:YLWBLDLAnew · submitted 2026-06-24 · ⚛️ physics.ao-ph

Evaluation of medium range machine learning models for sub-seasonal prediction

Catherine de Burgh-Day , Chen Li , Debra Hudson , Li Shi , Harrison Cook , Robin Wedd , Griffith Young This is my paper

Pith reviewed 2026-06-25 20:10 UTC · model grok-4.3

classification ⚛️ physics.ao-ph

keywords machine learning weather modelssub-seasonal predictionGraphCastFourCastNetV2Madden-Julian OscillationSouthern Annular Modehindcast evaluationensemble comparison

0 comments

The pith

Machine learning models built for medium-range forecasts show skill on sub-seasonal timescales comparable to physics-based ensembles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates two machine learning atmosphere models, GraphCast and FourCastNetV2, on their ability to make sub-seasonal predictions and to represent the Madden-Julian Oscillation and Southern Annular Mode. It applies a dual evaluation over a 38-year hindcast that overlaps the models' training data and a shorter 2.5-year independent hindcast. The ML models are scored against the Bureau of Meteorology's ACCESS-S2 physics-based ensemble and a more recent coupled model. The central result is that the ML models reach skill levels matching the physics ensemble mean at shorter sub-seasonal leads and individual ensemble members at longer leads. A reader would care because sub-seasonal forecasts support decisions in agriculture, energy, and risk management, so cheaper ML alternatives could widen access if the skill holds.

Core claim

Across the two evaluation periods, both ML models have surprisingly good skill for sub-seasonal timescales, given they were designed for forecasting on medium range timescales. In general, the ML models are as skilful as the physical model ensemble mean at shorter lead times and comparable to the physical model ensemble members at longer lead times.

What carries the argument

Dual hindcast evaluation that pairs a long overlapping period with a short independent period to compare ML deterministic and probabilistic skill against physics ensemble means and members.

If this is right

The ML models capture the MJO and SAM with skill levels similar to the physics models.
The overlapping-plus-independent period method provides a workable compromise for skill assessment when fully independent data are scarce.
At longer leads the ML models behave like individual ensemble members rather than the ensemble mean.
Operational sub-seasonal systems could add ML components at selected leads without immediate loss of overall skill.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the result holds on additional independent data, ML models could lower the cost of running large sub-seasonal ensembles.
The finding suggests medium-range ML training may transfer to longer leads without retraining from scratch.
The same models could be tested for seasonal prediction or additional climate drivers as a direct next step.
Hybrid ML-physics systems for sub-seasonal forecasting become a natural follow-on question.

Load-bearing premise

That skill measured partly on periods overlapping the models' training data still indicates real forecasting ability rather than memorization of seen cases.

What would settle it

A fresh multi-year independent hindcast set in which the ML models produce skill scores clearly below the physics ensemble mean at all sub-seasonal lead times would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.25417 by Catherine de Burgh-Day, Chen Li, Debra Hudson, Griffith Young, Harrison Cook, Li Shi, Robin Wedd.

**Figure 2.** Figure 2: The upper-row figures illustrate the anomaly correlation skills of weekly temperature at 1.5m at lead time week 1 across the 38-year hindcast period for: a single ACCESS-S2 ensemble member (a), GraphCast (b), and FourCastNetV2 (c). The bottom-row figures (d–f) represent the same variables as the upper row but at lead time week 3. These correlation skills are computed based on weekly anomalies relative to e… view at source ↗

**Figure 3.** Figure 3: As in Figure 2 except for precipitation. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: The phase diagram (top row) for two MJO events, beginning from the 1st of March 2012 (left column) and [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: MJO anomaly correlation (left) and RMSE (right) skill for all months in all years from 1981 to 2018 for [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Average MJO amplitude with lead time for all months in all years from 1981 to 2018 for ACCESS-S2 ensemble members (dotted blue), GraphCast (dashed yellow) and FourCastNetV2 (dashed red). The solid blue line shows the average of the amplitudes of the ACCESS-S2 ensemble members, not the amplitude of the ACCESS-S2 ensemble mean. The black line is the average MJO amplitude for ERA5 for all months in all years … view at source ↗

**Figure 7.** Figure 7: Weekly-average 850 hPa annual mean wind biases relative to ERA5 across the 38-year hindcast period for (a) GraphCast, (b) FourCastNetV2, (c) one ACCESS-S2 ensemble member, and (d) the ACCESS-S2 ensemble mean, at lead week 3. The biases are calculated by computing weekly means for each lead time, followed by annual means across model runs, and then subtracting from the equivalent from ERA5, before averaging… view at source ↗

**Figure 8.** Figure 8: Left column: Phase diagrams for all MJO events in the 38 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Rainfall composite maps for MJO events in NDJFM across the 38 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: SAM patterns, here defined as the first EOF of the weekly mean [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Daily SAM index anomaly correlation (left) and RMSE (right) skill for all months in all years from 1981 to [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Globally averaged correlation skills of (a) 10m u [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: The upper-row figures illustrate the anomaly correlation skills of weekly temperature at 1.5m at lead time week 1 across the 2.5-year hindcast period for: a single GC5 ensemble member (a), GraphCast (b), and FourCastNetV2 (c). The bottom-row figures (d–f) represent the same variables as the upper row but at lead time week 3. All anomalies are defined relative to the ERA5 climatology for the period 1981-20… view at source ↗

**Figure 14.** Figure 14: As in Figure 13 except for precipitation. [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: The phase diagram (top row) for two MJO events, beginning from the 1st of March 2023 (left column) [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: MJO anomaly correlation (left) and RMSE (right) skill for all months in all years from Jan 2022 to Jun 2024 [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: Weekly-average 850 hPa annual mean wind biases relative to ERA5 across the 2.5-year hindcast period for (a) GraphCast, (b) FourCastNetV2, (c) one GC5 ensemble member, and (d) the GC5 ensemble mean, at lead week 3. The biases are calculated by computing weekly means along lead time, followed by annual means across model runs, and then subtracting from the equivalent from ERA5, before averaging the differen… view at source ↗

**Figure 18.** Figure 18: Left Column: Phase diagrams for all MJO events in the 2.5 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: Daily SAM index correlation (left) and RMSE (right) skill for all months in all years from Jan 2022 to Jun [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗

read the original abstract

The performance of two machine learning (ML) atmosphere models - GraphCast and FourCastNetV2 - is evaluated in the context of sub-seasonal prediction, including their ability to represent key climate drivers of variability, namely the Madden-Julian Oscillation and the Southern Annular Mode. Model skill is assessed over both a 38-year hindcast period and a 2.5-year hindcast period. The longer period overlaps with the training windows of the ML models but provides a larger sample for robust evaluation, while the shorter period is independent of the ML model training period. This dual evaluation illustrates a compromise approach to the problem of insufficient independent data for evaluation of the models for sub-seasonal prediction. The ML models are compared against the Bureau of Meteorology's physics-based seasonal prediction system, ACCESS-S2, for the 38-year period, and a more recent physics-based coupled model for the shorter hindcast period. Across the two evaluation periods, both ML models have surprisingly good skill for sub-seasonal timescales, given they were designed for forecasting on medium range timescales. In general, the ML models are as skilful as the physical model ensemble mean at shorter lead times and comparable to the physical model ensemble members at longer lead times.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ML models show sub-seasonal skill comparable to physics ensembles, but overlapping hindcast data raises leakage issues that need checking.

read the letter

The main thing to know is that this paper finds GraphCast and FourCastNetV2 have skill at sub-seasonal prediction that is comparable to the ACCESS-S2 ensemble, but the 38-year evaluation period overlaps the models' training data, which undercuts how much we can trust the result.

What the paper does is take two established ML weather models and test them on longer leads than they were built for, while also looking at how well they capture the MJO and Southern Annular Mode. They compare against the Bureau of Meteorology's physics-based system over two different hindcast lengths. The longer one gives more samples but overlaps training, and the shorter 2.5-year one is independent but small. This dual approach is a practical way to handle the lack of truly independent data for these models.

It does a decent job of being upfront about the compromise. The abstract notes the ML models were designed for medium-range forecasts yet show good performance at sub-seasonal scales, matching the ensemble mean at shorter leads and individual members at longer ones.

The soft spot is the data leakage risk in the main evaluation period. With only the short independent window to rely on, there may not be enough cases to draw firm conclusions about skill at 2-6 week leads, especially with natural variability in things like the MJO. The stress-test concern holds up based on the abstract alone, since we don't have the breakdown of results by period. If the independent period shows much lower skill, the "surprisingly good" claim weakens.

This paper is for people working on sub-seasonal forecasting and those exploring ML for operational use. A reader interested in practical comparisons between ML and physics models would get value from the setup, even if the numbers need closer look.

It deserves peer review because the question is timely and the evaluation strategy is described clearly enough for referees to assess the methods and ask for more details on the independent period results.

Referee Report

2 major / 1 minor

Summary. The manuscript evaluates GraphCast and FourCastNetV2 for sub-seasonal prediction skill, including their representation of the MJO and SAM. It employs a dual hindcast strategy: a 38-year period overlapping the ML training windows and a 2.5-year independent period. The ML models are compared to ACCESS-S2 (38-year) and a more recent coupled physics model (2.5-year). The central claim is that both ML models exhibit surprisingly good sub-seasonal skill, performing comparably to the physical ensemble mean at shorter leads and to individual ensemble members at longer leads.

Significance. If the skill comparisons hold after addressing overlap and sample-size concerns, the result would indicate that medium-range ML models can usefully extend to sub-seasonal regimes without retraining, which is relevant for operational forecasting systems seeking to augment or replace physics-based ensembles.

major comments (2)

[Abstract] Abstract: The headline claim that the ML models show 'surprisingly good skill' and are 'as skilful as the physical model ensemble mean at shorter lead times' rests on the 38-year hindcast, yet this period overlaps the GraphCast/FourCastNetV2 training windows. Without explicit verification that the reported metrics (e.g., anomaly correlation or RMSE at 2–6 week leads) remain unchanged when restricted to the independent 2.5-year window, the comparison risks reflecting training-data leakage rather than generalization to sub-seasonal regimes.
[Abstract] Abstract and methods (dual-evaluation description): The 2.5-year independent period supplies only ~130 weekly cases. At sub-seasonal leads this yields limited degrees of freedom for distinguishing skill from sampling variability, especially when comparing against ensemble members whose spread is itself a source of uncertainty. No power analysis or bootstrap confidence intervals on the skill differences are mentioned, weakening support for the 'comparable to ensemble members at longer lead times' statement.

minor comments (1)

[Abstract] The abstract refers to 'a more recent physics-based coupled model' for the shorter period without naming it; this should be identified in the abstract or first paragraph of the introduction for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below and describe the revisions we will make to strengthen the statistical rigor and clarity of the evaluation.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim that the ML models show 'surprisingly good skill' and are 'as skilful as the physical model ensemble mean at shorter lead times' rests on the 38-year hindcast, yet this period overlaps the GraphCast/FourCastNetV2 training windows. Without explicit verification that the reported metrics (e.g., anomaly correlation or RMSE at 2–6 week leads) remain unchanged when restricted to the independent 2.5-year window, the comparison risks reflecting training-data leakage rather than generalization to sub-seasonal regimes.

Authors: We agree that the overlap between the 38-year hindcast and the ML training windows is a legitimate concern for interpreting generalization. The manuscript already introduces the dual-evaluation design specifically to address this, stating that the 2.5-year period is independent and that results hold 'across the two evaluation periods.' However, the referee is correct that the abstract does not explicitly demonstrate that the headline skill comparisons are reproduced in the independent window. We will therefore revise the abstract to qualify the claims more precisely and add a new results subsection (with accompanying figure) that reports anomaly correlation and RMSE for the 2–6 week leads restricted to the 2.5-year independent period, allowing direct comparison with the 38-year results. revision: yes
Referee: [Abstract] Abstract and methods (dual-evaluation description): The 2.5-year independent period supplies only ~130 weekly cases. At sub-seasonal leads this yields limited degrees of freedom for distinguishing skill from sampling variability, especially when comparing against ensemble members whose spread is itself a source of uncertainty. No power analysis or bootstrap confidence intervals on the skill differences are mentioned, weakening support for the 'comparable to ensemble members at longer lead times' statement.

Authors: We concur that ~130 cases provide limited statistical power at sub-seasonal leads and that ensemble spread adds further uncertainty to the comparisons. The manuscript presents the 2.5-year period as an independent check despite its small size, but we did not quantify sampling uncertainty. In revision we will add bootstrap confidence intervals (resampling over the 130 cases) to all skill-score differences shown for the 2.5-year period. We will also expand the discussion to note the additional uncertainty arising from ensemble spread. A formal a-priori power analysis is difficult without pre-specified effect sizes, but the bootstrap intervals will allow readers to assess the robustness of the 'comparable to ensemble members' statements. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model evaluation against external benchmarks

full rationale

The paper performs direct skill comparisons of GraphCast and FourCastNetV2 against ACCESS-S2 and another physics model over two hindcast periods. No derivations, equations, fitted parameters, or self-citation chains appear in the central claims. The dual-period evaluation is presented as an explicit compromise for data limitations rather than a derivation that reduces to its inputs. All reported skill metrics are computed from independent verification data against external physical-model references, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical evaluation study with no mathematical derivations, free parameters, axioms, or invented entities; all content rests on standard hindcast data and existing model outputs.

pith-pipeline@v0.9.1-grok · 5762 in / 1266 out tokens · 40956 ms · 2026-06-25T20:10:45.336672+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Alet, F., Price, I., El-Kadi, A., Masters, D., Markou, S., Andersson, T. R., ... & Battaglia, P. (2025). Skillful joint probabilistic weather forecasting from marginals. arXiv preprint arXiv:2506.10772. Antonio, B., Strommen, K., & Christensen, H. M. (2025). Seasonal forecasting using the GenCast probabilistic machine learning model. arXiv preprint arXiv:...

arXiv 2025
[2]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Cowan, T., Wheeler, M. C., & Marshall, A. G. (2023). The combined influence of the Madden–Julian Oscillation and El Niño–Southern Oscillation on Australian rainfall. Journal of Climate, 36(2), 313–334. https://doi.org/10.1175/JCLI-D-22-0357.1[1] DeMaria, M., Franklin, J. L., Chirokova, G., Radford, J., DeMaria, R., Musgrave, K. D., & Ebert-Uphoff, I. (202...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1175/jcli-d-22-0357.1 2023
[3]

& Hosking, J

Dunstan, T., Strickson, O., Bennett, T., Bowyer, J., Burnand, M., Chappell, J., ... & Hosking, J. S. (2025). FastNet: Improving the physical consistency of machine-learning weather prediction models through loss function design. arXiv preprint arXiv:2509.17601. Gillett NP, Kell TD, Jones PD (2006) Regional climate impacts of the southern annular mode. Geo...

work page doi:10.1029/2006gl027721 2025
[4]

& Thépaut, J

Hersbach, H., Bell, B., Berrisford, P., Hirahara, S., Horányi, A., Muñoz‐Sabater, J., ... & Thépaut, J. N. (2020). The ERA5 global reanalysis. Quarterly Journal of the Royal Meteorological Society, 146(730), 1999-2049. Hudson, D., Alves, O., Hendon, H.H., et al. 2017: ACCESS-S1: The new Bureau of Meteorology multi- week to seasonal prediction system. Jour...

work page doi:10.22499/3.6703.001 2020
[5]

& Rabier, F

Lang, S., Alexe, M., Chantry, M., Dramsch, J., Pinault, F., Raoult, B., ... & Rabier, F. (2024a). AIFS- ECMWF's data-driven forecasting system. arXiv preprint arXiv:2406.01465. OFFICIAL OFFICIAL Lang, S., Alexe, M., Clare, M. C., Roberts, C., Adewoyin, R., Bouallègue, Z. B., ... & Leutbecher, M. (2024b). AIFS-CRPS: Ensemble forecasting using a model train...

work page doi:10.1007/s00382-011-1140-z 2025

[1] [1]

Alet, F., Price, I., El-Kadi, A., Masters, D., Markou, S., Andersson, T. R., ... & Battaglia, P. (2025). Skillful joint probabilistic weather forecasting from marginals. arXiv preprint arXiv:2506.10772. Antonio, B., Strommen, K., & Christensen, H. M. (2025). Seasonal forecasting using the GenCast probabilistic machine learning model. arXiv preprint arXiv:...

arXiv 2025

[2] [2]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Cowan, T., Wheeler, M. C., & Marshall, A. G. (2023). The combined influence of the Madden–Julian Oscillation and El Niño–Southern Oscillation on Australian rainfall. Journal of Climate, 36(2), 313–334. https://doi.org/10.1175/JCLI-D-22-0357.1[1] DeMaria, M., Franklin, J. L., Chirokova, G., Radford, J., DeMaria, R., Musgrave, K. D., & Ebert-Uphoff, I. (202...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1175/jcli-d-22-0357.1 2023

[3] [3]

& Hosking, J

Dunstan, T., Strickson, O., Bennett, T., Bowyer, J., Burnand, M., Chappell, J., ... & Hosking, J. S. (2025). FastNet: Improving the physical consistency of machine-learning weather prediction models through loss function design. arXiv preprint arXiv:2509.17601. Gillett NP, Kell TD, Jones PD (2006) Regional climate impacts of the southern annular mode. Geo...

work page doi:10.1029/2006gl027721 2025

[4] [4]

& Thépaut, J

Hersbach, H., Bell, B., Berrisford, P., Hirahara, S., Horányi, A., Muñoz‐Sabater, J., ... & Thépaut, J. N. (2020). The ERA5 global reanalysis. Quarterly Journal of the Royal Meteorological Society, 146(730), 1999-2049. Hudson, D., Alves, O., Hendon, H.H., et al. 2017: ACCESS-S1: The new Bureau of Meteorology multi- week to seasonal prediction system. Jour...

work page doi:10.22499/3.6703.001 2020

[5] [5]

& Rabier, F

Lang, S., Alexe, M., Chantry, M., Dramsch, J., Pinault, F., Raoult, B., ... & Rabier, F. (2024a). AIFS- ECMWF's data-driven forecasting system. arXiv preprint arXiv:2406.01465. OFFICIAL OFFICIAL Lang, S., Alexe, M., Clare, M. C., Roberts, C., Adewoyin, R., Bouallègue, Z. B., ... & Leutbecher, M. (2024b). AIFS-CRPS: Ensemble forecasting using a model train...

work page doi:10.1007/s00382-011-1140-z 2025