pith. sign in

arxiv: 2507.18937 · v3 · submitted 2025-07-25 · ⚛️ physics.ao-ph · cs.AI· cs.LG· stat.ML

CNN-based Surface Temperature Forecasts with Ensemble Numerical Weather Prediction

Pith reviewed 2026-05-19 03:46 UTC · model grok-4.3

classification ⚛️ physics.ao-ph cs.AIcs.LGstat.ML
keywords convolutional neural networkensemble forecastingnumerical weather predictionsurface temperaturespatial downscalingbias correctionmedium-range forecast
0
0 comments X

The pith

Applying a CNN to each member of a low-resolution ensemble improves both deterministic and probabilistic surface temperature forecasts to 5 km resolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a convolutional neural network can be trained to correct biases and spatially downscale each individual member of a 40 km resolution ensemble numerical weather prediction run. This member-wise correction produces 5 km surface temperature forecasts out to 5.5 days that are more accurate than the raw low-resolution output and more reliable than simple ensemble averages. The resulting high-resolution ensemble maintains spatial detail while reducing noise, offering a lower-cost alternative to running full high-resolution models. Operational centers with constrained computing power could adopt the approach by post-processing their existing ensemble output.

Core claim

CNN-based post-processing applied separately to each of the 51 ensemble members reduces systematic errors and performs spatial downscaling from 40 km to 5 km, yielding improved deterministic accuracy and better probabilistic reliability with a spread-skill ratio that differs from the smoothing effect of ensemble averaging.

What carries the argument

Member-wise CNN post-processing that performs bias correction and spatial downscaling on individual ensemble members before recombining them into a high-resolution ensemble forecast.

If this is right

  • Deterministic forecast accuracy improves through bias correction and downscaling on each member.
  • Probabilistic reliability and spread-skill ratio improve in a manner distinct from the error reduction of ensemble averaging.
  • Forecast information is maintained at levels comparable to other high-resolution forecasts rather than being smoothed away.
  • The method supplies a practical, scalable route to better medium-range temperature predictions for centers with limited computational resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same member-wise correction could be tested on other surface variables or vertical levels to check whether the CNN learns general error patterns.
  • Periodic retraining on recent model output would be needed if the underlying NWP model physics or resolution changes.
  • The approach might combine with existing high-resolution limited-area models to blend global ensemble information with local detail.

Load-bearing premise

The CNN trained on historical low-resolution forecasts and verifying analyses will generalize to future independent forecasts without significant degradation from changing model versions or unrepresented error regimes.

What would settle it

Running the trained CNN on forecasts from a new version of the underlying NWP model and finding no skill gain or outright degradation relative to the uncorrected low-resolution ensemble.

Figures

Figures reproduced from arXiv: 2507.18937 by Japan), Takuya Inoue, Takuya Kawabata (Meteorological Research Institute, Tsukuba.

Figure 1
Figure 1. Figure 1: Schematic of the CNN (reprinted from our previous work [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
read the original abstract

Due to limited computational resources, medium-range temperature forecasts typically rely on low-resolution numerical weather prediction (NWP) models, which are prone to systematic and random errors. We propose a method that integrates a convolutional neural network (CNN) with an ensemble of low-resolution NWP models (40-km horizontal resolution) to produce high-resolution (5-km) surface temperature forecasts with lead times extending up to 5.5 days (132 h). First, CNN-based post-processing (bias correction and spatial downscaling) is applied to individual ensemble members to reduce systematic errors and perform downscaling, which improves the deterministic forecast accuracy. Second, this member-wise correction is applied to all 51 ensemble members to construct a new high-resolution ensemble forecasting system with an improved probabilistic reliability and spread-skill ratio that differs from the simple error reduction mechanism of ensemble averaging. Whereas averaging reduces forecast errors by smoothing spatial fields, our member-wise CNN correction reduces error from noise while maintaining forecast information at a level comparable to that of other high-resolution forecasts. Experimental results indicate that the proposed method provides a practical and scalable solution for improving medium-range temperature forecasts, which is particularly valuable for use in operational centers with limited computational resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes integrating a convolutional neural network (CNN) with a 51-member low-resolution (40 km) ensemble NWP system to generate high-resolution (5 km) surface temperature forecasts out to 132 h lead time. Member-wise CNN post-processing performs bias correction and spatial downscaling on each ensemble member; the corrected members are then used to form a new ensemble whose deterministic accuracy, probabilistic reliability, and spread-skill ratio are claimed to exceed those obtained by simple ensemble averaging or by other high-resolution forecasts. The method is presented as a computationally lightweight, scalable solution for operational centers lacking resources for high-resolution NWP.

Significance. If the reported gains in accuracy and reliability hold on truly independent future forecasts, the approach would offer a practical route to improved medium-range temperature guidance without requiring additional high-resolution model integrations. The distinction drawn between error reduction via member-wise correction versus smoothing via averaging is conceptually useful for ensemble post-processing literature.

major comments (2)
  1. [Abstract] Abstract: the central claims of improved deterministic accuracy, probabilistic reliability, and spread-skill ratio are asserted without any quantitative metrics, verification periods, baseline comparisons, or uncertainty estimates. Because these numbers are load-bearing for the practical-and-scalable-solution conclusion, their absence prevents assessment of effect size or statistical significance.
  2. [Results / Discussion] Results / Discussion: the claim that the CNN learns invariant physical relationships rather than transient model-specific biases rests on the untested assumption that training and test periods are separated by model upgrades or regime shifts. No cross-validation across different model versions, seasons, or climate states is described, directly undermining the operational generalization argument.
minor comments (2)
  1. [Methods] Notation for the CNN architecture (number of layers, filter sizes, activation functions) should be stated explicitly in the Methods section rather than left to supplementary material.
  2. [Figures] Figure captions should include the exact verification period, number of cases, and baseline models used for each panel to allow immediate comparison with the text claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment in turn and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claims of improved deterministic accuracy, probabilistic reliability, and spread-skill ratio are asserted without any quantitative metrics, verification periods, baseline comparisons, or uncertainty estimates. Because these numbers are load-bearing for the practical-and-scalable-solution conclusion, their absence prevents assessment of effect size or statistical significance.

    Authors: We agree that the abstract would be strengthened by the inclusion of key quantitative results. The body of the manuscript reports specific verification metrics (including RMSE reductions, CRPS improvements, reliability diagram scores, and spread-skill ratios) over a multi-month independent test period, with comparisons to the raw 40 km ensemble and other high-resolution references. In the revised manuscript we will add concise quantitative statements and the verification period to the abstract so that the magnitude of the reported gains is immediately apparent. revision: yes

  2. Referee: [Results / Discussion] Results / Discussion: the claim that the CNN learns invariant physical relationships rather than transient model-specific biases rests on the untested assumption that training and test periods are separated by model upgrades or regime shifts. No cross-validation across different model versions, seasons, or climate states is described, directly undermining the operational generalization argument.

    Authors: The training and test periods in our experiments are temporally disjoint, with the test window occurring after the training data to emulate operational use. However, we did not conduct explicit cross-validation across ECMWF model cycles or additional climate regimes. We will revise the Methods and Discussion sections to state the exact dates of the training and test periods, note any known model changes within that interval, and explicitly acknowledge the limitation on broader generalization. If space permits we will also add a brief sensitivity experiment using an alternate seasonal split. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical ML post-processing on external forecast-analysis pairs

full rationale

The paper trains a CNN on historical low-resolution ensemble forecasts paired with verifying analyses to perform bias correction and downscaling, then evaluates the trained model on separate test periods. This is a standard supervised learning pipeline whose outputs on new inputs are not equivalent to the training data by construction. No equations define a target metric in terms of itself, no fitted parameters are relabeled as independent predictions, and no load-bearing claims rest on self-citations or author-specific uniqueness theorems. The central results (improved deterministic accuracy, probabilistic reliability, spread-skill ratio) are obtained by direct comparison against independent verification data and therefore remain falsifiable outside the fitted values.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim depends on the CNN learning stable error patterns from historical data and on the low-resolution ensemble containing spatially coherent information that survives downscaling.

free parameters (1)
  • CNN network weights
    The convolutional neural network parameters are optimized on historical forecast-analysis pairs to minimize post-processing error.
axioms (1)
  • domain assumption Error characteristics of the 40-km NWP model are sufficiently stationary and spatially structured to be learned and corrected by a CNN trained on past cases.
    This premise is required for the member-wise correction step to improve future forecasts.

pith-pipeline@v0.9.0 · 5756 in / 1299 out tokens · 74819 ms · 2026-05-19T03:46:10.874374+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

  1. [1]

    Lagouvardos, V

    Anadranistakis, M., K. Lagouvardos, V. Kotroni, and H. Elefteriadis, 2004: Correcting temperature and humidity forecasts using Kalman filtering: potential for agricultural protection in Northern Greece. Atmos. Res., 71, 115–125, https://doi.org/10.1016/j.atmosres.2004.03.007. 28

  2. [2]

    Araki, K., 2019: Study on heavy snowfall associated with ‘South-Coast Cyclones’: Present state and future work. Meteor. Res. Notes, No. 241, 605–614, Japan Meteorological

  3. [3]

    Bauer, A

    Bauer, P., A. Thorpe, and G. Brunet, 2015: The quiet revolution of numerical weather prediction. Nature, 525, 47–55, https://doi.org/10.1038/nature14956

  4. [5]

    Cho, D., C. Yoo, B. Son, J. Im, D. Yoon, and D.-H. Cha, 2022: A novel ensemble learning for post-processing of NWP Model’s next-day maximum air temperature forecast in summer using deep learning and statistical approaches. Wea. Climate Extreme, 35, 100410, https://doi.org/10.1016/j.wace.2022.100410

  5. [6]

    Dosovitskiy, A., and Coauthors, 2021: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Proc. 9th Int. Conf. on Learning Representations (ICLR 2021), Virtual Only, Computational and Biological Learning Society, Paper 3458, https://doi.org/10.48550/ARXIV.2010.11929

  6. [7]

    Matsuzawa, 2009: Snowfall amount guidance

    Furuichi, Y., and N. Matsuzawa, 2009: Snowfall amount guidance. In Textbook for Numerical Weather Prediction, No. 42, Japan Meteorological Agency, Tokyo, Japan, 27–38, https://www.jma.go.jp/jma/kishou/books/nwptext/42/chapter2.pdf. (in Japanese)

  7. [8]

    R., and D

    Glahn, H. R., and D. A. Lowry, 1972: The Use of Model Output Statistics (MOS) in Objective Weather Forecasting. J. Appl. Meteor. Climatol., 11, 1203–1211, https://doi.org/10.1175/1520-0450(1972)011<1203:TUOMOS>2.0.CO;2

  8. [9]

    Boers, 2022: Deep Learning for Improving Numerical Weather Prediction of Heavy Rainfall

    Hess, P., and N. Boers, 2022: Deep Learning for Improving Numerical Weather Prediction of Heavy Rainfall. J. Adv. Model. Earth Syst., 14, e2021MS002765, https://doi.org/10.1029/2021MS002765

  9. [11]

    Hunt, B. R., E. J. Kostelich, and I. Szunyogh, 2007: Efficient data assimilation for spatiotemporal chaos: A local ensemble transform Kalman filter. Phys. D, 230, 112– 126, https://doi.org/10.1016/j.physd.2006.11.008

  10. [12]

    Fujita, Y

    Ikuta, Y., T. Fujita, Y. Ota, and Y. Honda, 2021: Variational Data Assimilation System for Operational Regional Models at Japan Meteorological Agency. J. Meteor. Soc. Japan, 99, 1563–1592, https://doi.org/10.2151/jmsj.2021-076

  11. [13]

    Inoue, T., T. T. Sekiyama, and A. Kudo, 2024: Development of a Temperature Prediction Method Combining Deep Neural Networks and a Kalman Filter. J. Meteor. Soc. Japan, 102, 415–427, https://doi.org/10.2151/jmsj.2024-020. Intergovernmental Panel on Climate Change (IPCC), 2023: Climate Change 2021 – The Physical Science Basis: Working Group I Contribution t...

  12. [14]

    S., and N

    Jennings, K. S., and N. P. Molotch, 2019: The sensitivity of modeled snow accumulation and melt to precipitation phase methods across a climatic gradient. Hydrol. Earth Syst. Sci., 23, 3765–3786, https://doi.org/10.5194/hess-23-3765-2019

  13. [15]

    Kawabata, T., H. Seko, K. Saito, T. Kuroda, K. Tamiya, T. Tsuyuki, Y. Honda, and Y. Wakazuki, 2007: An Assimilation and Forecasting Experiment of the Nerima Heavy Rainfa11 with a Cloud-Resolving Nonhydrostatic 4-Dimensional Variational Data Assimilation System. J. Meteor. Soc. Japan, 85, 255–276, https://doi.org/10.2151/jmsj.85.255. 30 ——, T. Kuroda, H. S...

  14. [16]

    Sutskever, and G

    Krizhevsky, A., I. Sutskever, and G. E. Hinton, 2012: ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 25 (NeurIPS 2012), Lake Tahoe, NV, Neural Inf. Process. Syst. Foundation, 1097–1105, https://papers.nips.cc/paper_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c 45b-Abstract.html

  15. [17]

    Kudo, A., 2022: Statistical Post-Processing for Gridded Temperature Prediction Using Encoder–Decoder-Based Deep Convolutional Neural Networks. J. Meteor. Soc. Japan, 100, 219–232, https://doi.org/10.2151/jmsj.2022-011

  16. [18]

    In Textbook for Numerical Weather

    Kuroki, Y., 2017: Improvement of gridded temperature guidance and changes of guidance for snowfall amount and categorized weather. In Textbook for Numerical Weather

  17. [19]

    E., 1974: Theoretical Skill of Monte Carlo Forecasts

    Leith, C. E., 1974: Theoretical Skill of Monte Carlo Forecasts. Mon. Wea. Rev., 102, 409– 418, https://doi.org/10.1175/1520-0493(1974)102<0409:TSOMCF>2.0.CO;2

  18. [20]

    N., 1969: The predictability of a flow which possesses many scales of motion

    Lorenz, E. N., 1969: The predictability of a flow which possesses many scales of motion. Tellus, 21, 289–307, https://doi.org/10.1111/j.2153-3490.1969.tb00444.x. Ministry of Land, Infrastructure, Transport and Tourism, 2022: Emergency statement concerning heavy snowfall. Tech. doc., Ministry of Land, Infrastructure, Transport and

  19. [21]

    (in Japanese)

    Tourism, Tokyo, Japan, 3 pp, https://www.mlit.go.jp/common/001463621.pdf. (in Japanese)

  20. [22]

    N., 2001: A nonlinear dynamical perspective on model error: A proposal for non‐ local stochastic‐dynamic parametrization in weather and climate prediction models

    Palmer, T. N., 2001: A nonlinear dynamical perspective on model error: A proposal for non‐ local stochastic‐dynamic parametrization in weather and climate prediction models. Quart. J. Roy. Meteor. Soc., 127, 279–304, https://doi.org/10.1002/qj.49712757202

  21. [23]

    In Report of Numerical Prediction Division, No

    Sannohe, Y., 2018: Temperature guidance. In Report of Numerical Prediction Division, No. 64, Japan Meteorological Agency, Tokyo, Japan, 132–143, https://www.jma.go.jp/jma/kishou/books/nwpreport/64/chapter4.pdf. (in Japanese). 31

  22. [24]

    Sayeed, A., Y. Choi, J. Jung, Y. Lops, E. Eslami, and A. K. Salman, 2023: A Deep Convolutional Neural Network Model for Improving WRF Simulations. IEEE Trans. Neural Netw. Learn. Syst., 34, 750–760, https://doi.org/10.1109/TNNLS.2021.3100902

  23. [25]

    Sha, Y., D. J. Gagne Ii, G. West, and R. Stull, 2022: A hybrid analog-ensemble, convolutional-neural-network method for post-processing precipitation forecasts. Mon. Wea. Rev., https://doi.org/10.1175/MWR-D-21-0154.1

  24. [26]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Simonyan, K., and A. Zisserman, 2015: Very Deep Convolutional Networks for Large-Scale Image Recognition. Proc. 3rd Int. Conf. on Learning Representations (ICLR 2015), Computational and Biological Learning Society, San Diego, CA, Paper 1409.1556, https://doi.org/10.48550/arXiv.1409.1556

  25. [27]

    C., 2004: Evaluating Mesoscale NWP Models Using Kinetic Energy Spectra

    Skamarock, W. C., 2004: Evaluating Mesoscale NWP Models Using Kinetic Energy Spectra. Mon. Wea. Rev., 132, 3019–3032, https://doi.org/10.1175/MWR2830.1

  26. [28]

    J., 2007: Parameterization Schemes: Keys to Understanding Numerical Weather Prediction Models

    Stensrud, D. J., 2007: Parameterization Schemes: Keys to Understanding Numerical Weather Prediction Models. 1st ed. Cambridge University Press, https://doi.org/10.1017/CBO9780511812590

  27. [29]

    Swinbank, R., and Coauthors, 2016: The TIGGE Project and Its Achievements. Bull. Amer. Meteor. Soc., 97, 49–67, https://doi.org/10.1175/BAMS-D-13-00191.1

  28. [30]

    L., Boyle, J

    Toth, Z., and E. Kalnay, 1997: Ensemble Forecasting at NCEP and the Breeding Method. Mon. Wea. Rev., 125, 3297–3319, https://doi.org/10.1175/1520- 0493(1997)125<3297:EFANAT>2.0.CO;2

  29. [31]

    Wakayama, I., T. Imai, T. Kitamura, and K. Kobayashi, 2020: About estimated weather distribution. Wea. Serv. Bull., 87, 1–18, Japan Meteorological Society, Tokyo, Japan, ISSN 1342-5692, https://www.jma.go.jp/jma/kishou/books/sokkou/87/vol87p001.pdf

  30. [32]

    Wang, J., J. Chen, J. Du, Y. Zhang, Y. Xia, and G. Deng, 2018: Sensitivity of Ensemble Forecast Verification to Model Bias. Mon. Wea. Rev., 146, 781–796, https://doi.org/10.1175/MWR-D-17-0223.1

  31. [33]

    S., 2011: Statistical Methods in the Atmospheric Sciences

    Wilks, D. S., 2011: Statistical Methods in the Atmospheric Sciences. 3rd Edition. Academic Press 704pp. 32

  32. [34]

    Kawabata, and L

    Wu, P.-Y., T. Kawabata, and L. Duc, 2025: The Importance of Perturbation Rank in Ensemble Simulations. Mon. Wea. Rev., 153, 247–261, https://doi.org/10.1175/MWR- D-24-0067.1