AI-informed model-analogs for understanding subseasonal-to-seasonal jet stream and North American temperature predictability

Elizabeth A. Barnes; Jacob B. Landsberg; Matthew Newman

arxiv: 2506.14022 · v3 · submitted 2025-06-16 · ⚛️ physics.ao-ph · cs.LG

AI-informed model-analogs for understanding subseasonal-to-seasonal jet stream and North American temperature predictability

Jacob B. Landsberg , Matthew Newman , Elizabeth A. Barnes This is my paper

Pith reviewed 2026-05-19 08:53 UTC · model grok-4.3

classification ⚛️ physics.ao-ph cs.LG

keywords subseasonal-to-seasonal forecastinganalog methodsneural networksjet streamtemperature predictabilityNorth Americainterpretable AI

0 comments

The pith

AI-learned weights improve subseasonal forecasts of North American temperatures and jet stream winds over standard analog methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a neural network can learn to select better analogs from past weather states for predicting conditions weeks to months ahead. It applies this to three tasks: classifying summer temperatures in Southern California, regressing Midwest U.S. summer temperatures, and classifying North Atlantic winter winds. The AI-weighted analogs beat conventional analog selection as well as climatology and persistence baselines on both climate-model output and reanalysis data, for both average skill and extreme-event prediction. The learned weights also let the authors inspect which past patterns carry predictability. A sympathetic reader would care because subseasonal forecasts affect agriculture, disaster preparedness, and health planning, yet skill remains low at these lead times.

Core claim

Training a neural network to output a mask of weights that optimizes analog selection produces higher deterministic and probabilistic forecast skill than traditional analog methods or simple baselines for subseasonal-to-seasonal temperature and wind predictions; the resulting ensembles also represent extremes and forecast uncertainty more accurately, and the masks themselves highlight sources of predictability.

What carries the argument

A neural network mask of weights that reweights past states to choose the most relevant analogs for a given forecast target.

If this is right

Analog ensembles built with the learned weights produce more accurate predictions of temperature extremes.
The same ensembles give a better representation of forecast uncertainty than traditional methods.
Inspecting the weight masks identifies physical sources of subseasonal-to-seasonal predictability.
The performance advantage appears for both classification and regression tasks on climate model and reanalysis data alike.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same weight-learning approach could be tested on other variables such as precipitation or on different regions to map predictability sources more broadly.
If the masks remain stable across decades, the method could serve as a diagnostic tool for how predictability changes under altered climate conditions.
Hybrid systems that combine these interpretable analogs with dynamical model output might further raise subseasonal skill.

Load-bearing premise

The weight mask learned on one set of training data will select useful analogs when applied to new time periods or different datasets without capturing spurious correlations.

What would settle it

Applying the learned masks to an independent future period or a different climate model and finding no gain in skill over standard analogs or baselines would falsify the claim of improved performance.

read the original abstract

Subseasonal-to-seasonal forecasting is crucial for public health, disaster preparedness, and agriculture, and yet it remains a particularly challenging timescale to predict. We explore the use of an interpretable AI-informed model analog forecasting approach, previously employed on longer timescales, to improve S2S predictions. Using an artificial neural network, we learn a mask of weights to optimize analog selection and showcase its versatility across three varied prediction tasks: 1) classification of Week 3-4 Southern California summer temperatures; 2) regional regression of Month 1 midwestern U.S. summer temperatures; and 3) classification of Month 1-2 North Atlantic wintertime upper atmospheric winds. The AI-informed analogs outperform traditional analog forecasting approaches, as well as climatology and persistence baselines, for deterministic and probabilistic skill metrics on both climate model and reanalysis data. We find the analog ensembles built using the AI-informed approach also produce better predictions of temperature extremes and improve representation of forecast uncertainty. Finally, by using an interpretable-AI framework, we analyze the learned masks of weights to better understand S2S sources of predictability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This applies an existing AI analog weighting trick to three S2S tasks but the abstract gives no numbers or validation details to judge whether the gains are real.

read the letter

The paper takes a neural-net mask for weighting analogs, previously used on longer timescales, and tests it on week 3-4 Southern California temperature classification, month 1 Midwest temperature regression, and month 1-2 North Atlantic jet classification. It reports better deterministic and probabilistic skill than plain analogs, climatology, and persistence on both climate model output and reanalysis, plus some improvement on extremes and uncertainty estimates. The interpretability step that looks at the learned masks is a reasonable addition for understanding sources of predictability.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces an interpretable AI-informed model-analog method for subseasonal-to-seasonal (S2S) forecasting. An artificial neural network learns a mask of weights to optimize analog selection from climate model or reanalysis data. This is demonstrated on three tasks: Week 3-4 Southern California summer temperature classification, Month 1 Midwest U.S. summer temperature regression, and Month 1-2 North Atlantic winter jet stream classification. The central claim is that the AI-informed analogs outperform traditional analog methods as well as climatology and persistence baselines on deterministic and probabilistic skill metrics, while also improving extreme event prediction and uncertainty representation; the learned masks are then interpreted to identify sources of S2S predictability.

Significance. If the performance gains are shown to arise from genuine predictability rather than overfitting, the approach would offer a useful bridge between data-driven analog forecasting and physical insight into S2S sources. The multi-task design and dual use of model and reanalysis data are strengths. The work builds on prior model-analog literature but would benefit from stronger quantitative grounding to establish its added value for the S2S community.

major comments (2)

[§3] §3 (Methods, neural-network mask training): The manuscript does not describe whether the weight mask is trained on time periods that are strictly non-overlapping with the verification windows for the three tasks, nor whether year-block or ensemble-member cross-validation is employed. Given the strong serial correlation and low-frequency variability typical of S2S fields, this detail is load-bearing for the claim that the mask generalizes and captures real predictability rather than spurious correlations.
[§4] §4 (Results, skill-score comparisons): The abstract and results state outperformance on deterministic and probabilistic metrics without supplying the actual numerical values, confidence intervals, or statistical significance tests relative to the traditional analog, climatology, and persistence baselines. This absence prevents quantitative assessment of the magnitude and robustness of the reported improvements for the Week 3-4 classification and Month 1 regression tasks.

minor comments (2)

[Figure 3] Figure 3 caption: The color scale for the learned weight masks should explicitly state the normalization range and whether positive/negative values correspond to enhanced or suppressed analog contributions.
[§2.3] §2.3: The precise form of the analog distance metric after multiplication by the learned mask is not written as an equation; adding this would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment in detail below, providing additional methodological information and quantitative results where needed. Revisions have been made to strengthen the presentation of our AI-informed model-analog approach for S2S forecasting.

read point-by-point responses

Referee: §3 (Methods, neural-network mask training): The manuscript does not describe whether the weight mask is trained on time periods that are strictly non-overlapping with the verification windows for the three tasks, nor whether year-block or ensemble-member cross-validation is employed. Given the strong serial correlation and low-frequency variability typical of S2S fields, this detail is load-bearing for the claim that the mask generalizes and captures real predictability rather than spurious correlations.

Authors: We thank the referee for emphasizing this crucial aspect of experimental design. The original manuscript outlined the neural network architecture and loss function in Section 3 but did not explicitly state the temporal separation protocol. In the revised version, we have added a dedicated paragraph in the Methods section clarifying that mask training uses strictly non-overlapping historical periods (e.g., training on 1980–2010 data for verification on 2011–2020 windows) and employs year-block cross-validation to account for serial correlation and low-frequency variability. For the climate model ensemble, we further apply leave-one-ensemble-member-out validation. These choices ensure the learned weights reflect genuine predictability sources rather than spurious correlations, and we have included a brief justification referencing standard practices in S2S literature. revision: yes
Referee: §4 (Results, skill-score comparisons): The abstract and results state outperformance on deterministic and probabilistic metrics without supplying the actual numerical values, confidence intervals, or statistical significance tests relative to the traditional analog, climatology, and persistence baselines. This absence prevents quantitative assessment of the magnitude and robustness of the reported improvements for the Week 3-4 classification and Month 1 regression tasks.

Authors: We agree that explicit numerical values, confidence intervals, and significance tests would strengthen the quantitative assessment. In the revised manuscript, we have inserted a new table (Table 2) in Section 4 reporting specific skill scores—including accuracy and Brier scores for classification tasks, RMSE for regression—along with 95% bootstrap confidence intervals and p-values from paired statistical tests (e.g., Wilcoxon signed-rank for non-parametric comparison) against the traditional analog, climatology, and persistence baselines. The main text now references these values directly for the Week 3-4 and Month 1 tasks. Due to abstract length limits, we retained the qualitative statement of outperformance but added a sentence directing readers to the new table for quantitative details and robustness checks. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on empirical NN optimization and out-of-sample metrics

full rationale

The paper trains an artificial neural network on historical data to produce a weight mask that reweights analog selection for three S2S tasks, then evaluates deterministic and probabilistic skill on separate test periods in both climate model output and reanalysis. No equation reduces the reported skill gains to a fitted parameter by construction, no self-citation supplies a uniqueness theorem that forces the method, and the central claim (outperformance versus traditional analogs, climatology, and persistence) is presented as an empirical result rather than a definitional identity. The approach is therefore self-contained against external benchmarks once proper temporal hold-out is assumed.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that historical analogs remain informative at S2S lead times and that the neural network can learn non-spurious weights; the learned mask constitutes a fitted component whose robustness is not detailed in the abstract.

free parameters (1)

neural network mask weights
Learned parameters that determine the importance of different analog features for the three prediction tasks.

axioms (1)

domain assumption Historical climate states serve as useful analogs for future states at subseasonal-to-seasonal timescales
Core premise of the analog forecasting framework applied here.

pith-pipeline@v0.9.0 · 5736 in / 1184 out tokens · 47219 ms · 2026-05-19T08:53:48.385248+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using an artificial neural network, we learn a mask of weights to optimize analog selection... The MSE between these two weighted maps is passed through a single linear scaling layer... Loss is computed as the MSE between the predicted difference of the targets and the true difference of the targets.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat_induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ a 7-day sliding window... tercile classification... ensemble agreement... discard plots

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.