PhysMetrics.Weather: An Evaluation Framework for Physical Consistency in ML Weather Models
Pith reviewed 2026-06-27 13:51 UTC · model grok-4.3
The pith
PhysMetrics.Weather supplies conservation, spectral, and dynamical metrics to test physical consistency in ML weather models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that PhysMetrics.Weather quantifies the physical realism of MLWP models by scoring them on conservation metrics that verify preservation of physical quantities, spectral metrics that assess energy distribution across wavenumbers, and dynamical metrics that check consistency with expected time evolution, thereby providing a tool to guide physics-informed architecture development and to evaluate operational reliability beyond standard error metrics.
What carries the argument
The PhysMetrics.Weather framework, which scores model forecasts using three metric families—conservation, spectral, and dynamical—to quantify physical realism.
If this is right
- Models can be compared and selected using physical consistency scores in addition to accuracy metrics such as RMSE.
- Architecture choices for new MLWP models can be guided by iterative feedback from the conservation, spectral, and dynamical scores.
- Operational deployment of ML weather forecasts can incorporate pass/fail thresholds from the framework to reduce risk of unphysical outputs.
- Standardized reporting of these metrics enables direct comparison of physical fidelity across different MLWP approaches.
Where Pith is reading between the lines
- The metrics could be folded into training objectives so that physical consistency becomes an optimization target rather than only a post-hoc check.
- The same three-family structure might transfer to other Earth-system or fluid-dynamics ML applications where realism beyond statistical fit matters.
- Community extensions of the open framework could add further physical constraints, such as thermodynamic or radiative balance tests.
Load-bearing premise
That the three metric families together provide a sufficient and reliable test of whether an MLWP model is physically consistent enough for operational use.
What would settle it
An MLWP model that passes all three metric families at high levels yet produces forecasts that violate independent physical constraints in controlled real-world test cases would falsify the claim that the framework suffices for operational evaluation.
Figures
read the original abstract
Machine learning weather prediction (MLWP) models have achieved impressive forecasting performance at a small fraction of the computational costs required for traditional physics-based methods. However, they are primarily (1) data-driven and (2) evaluated using pixel-wide error metrics (e.g., RMSE), so there are no guarantees that their forecasts are consistent with known physical laws. We introduce PhysMetrics$.$Weather, an evaluation framework that assesses the physical realism of MLWP models across three types of metrics: conservation, spectral, and dynamical. By quantifying physical realism, this tool guides the development of physics-informed architectures and helps evaluate whether MLWP models are reliable for operational use. Our framework is available on Github at https://github.com/Emmakast/PhysMetrics.Weather.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PhysMetrics.Weather, an evaluation framework for machine learning weather prediction (MLWP) models. It defines three families of metrics—conservation, spectral, and dynamical—to quantify physical realism beyond standard pixel-wise error measures such as RMSE, with the goal of guiding physics-informed model development and assessing operational reliability. The associated code is released on GitHub.
Significance. A well-validated, open-source framework that systematically measures physical consistency in MLWP forecasts would be a useful contribution to the rapidly growing ML weather modeling literature. The absence of any reported implementation details, metric definitions, or validation experiments in the manuscript, however, prevents assessment of whether the proposed metrics actually deliver on that promise.
major comments (1)
- [Abstract] The central claim that the three metric families together 'provide a sufficient and reliable test' of physical consistency for operational use (abstract) is not supported by any derivation, pseudocode, or empirical validation within the manuscript. Without concrete definitions or tests against known physical violations, it is impossible to judge whether the framework is load-bearing for its stated purpose.
Simulated Author's Rebuttal
We thank the referee for their review and for identifying the need for stronger evidentiary support in the manuscript. We address the major comment below and commit to revisions that add the requested concrete elements.
read point-by-point responses
-
Referee: [Abstract] The central claim that the three metric families together 'provide a sufficient and reliable test' of physical consistency for operational use (abstract) is not supported by any derivation, pseudocode, or empirical validation within the manuscript. Without concrete definitions or tests against known physical violations, it is impossible to judge whether the framework is load-bearing for its stated purpose.
Authors: We agree that the abstract phrasing risks implying sufficiency without accompanying detail, and that the current manuscript text (which focuses on high-level motivation and metric families) does not include explicit derivations, pseudocode, or validation experiments. This is a valid observation. We will revise the abstract to remove any suggestion of a complete or sufficient test and will add a new section that (1) provides formal definitions and pseudocode for each metric family, (2) reports implementation details, and (3) includes empirical validation on controlled cases with known physical violations (e.g., injected mass non-conservation or spectral artifacts). The GitHub repository already contains these components; the revision will surface them directly in the paper. revision: yes
Circularity Check
No significant circularity in framework definition
full rationale
The paper defines PhysMetrics.Weather as an evaluation framework consisting of conservation, spectral, and dynamical metrics to assess physical realism in ML weather prediction models. No derivation chain, fitted parameters, predictions, or load-bearing self-citations are present; the contribution is the introduction of the tool itself rather than any claim that reduces to its own inputs by construction. The work is self-contained as a definitional contribution with no equations or uniqueness theorems invoked that could create circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
ECMWF.IFS Documentation CY49R1 - Part III: Dynamics and Numerical Procedures. ECMWF, 2024. doi: 10.21957/d04fb7a27e. URL https://www.ecmwf.int/en/elibrary/ 81625-ifs-documentation-cy49r1-part-iii-dynamics-and-numerical-procedures
-
[2]
Global forecast system (gfs) 1.0 degree
NOAA. Global forecast system (gfs) 1.0 degree. Dataset, 2004. URL https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/ global-forecast-system-gfs. NCEI DSI 6174
2004
-
[3]
Can deep learning beat numerical weather prediction?Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 379(2194), 2021
Martin G Schultz, Clara Betancourt, Bing Gong, Felix Kleinert, Michael Langguth, Lukas Hu- bert Leufen, Amirpasha Mozaffari, and Scarlet Stadtler. Can deep learning beat numerical weather prediction?Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 379(2194), 2021
2021
-
[4]
Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, Pedram Hassanzadeh, Karthik Kashinath, and Animashree Anandkumar. FourCastNet: A Global Data- driven High-resolution Weather Model using Adaptive Fourier Neural Operators, February 2022...
Pith/arXiv arXiv 2022
-
[5]
Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Pangu-Weather: A 3D High-Resolution Model for Fast and Accurate Global Weather Forecast, November 2022. URLhttp://arxiv.org/abs/2211.02556. arXiv:2211.02556 [physics]
arXiv 2022
-
[6]
GraphCast: Learning skillful medium-range global weather forecasting, August 2023
Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, Alexander Merose, Stephan Hoyer, George Holland, Oriol Vinyals, Jacklynn Stott, Alexander Pritzel, Shakir Mohamed, and Peter Battaglia. GraphCast: Learning skillful medium-range global weather forecas...
arXiv 2023
-
[7]
P ., Lucic, A., Stanley, M., Allen, A., Brandstetter, J., Garvan, P ., Riechert, M., Weyn, J
Cristian Bodnar, Wessel P. Bruinsma, Ana Lucic, Megan Stanley, Anna Allen, Johannes Brandstetter, Patrick Garvan, Maik Riechert, Jonathan A. Weyn, Haiyu Dong, Jayesh K. Gupta, Kit Thambiratnam, Alexander T. Archibald, Chun-Chieh Wu, Elizabeth Heider, Max Welling, Richard E. Turner, and Paris Perdikaris. A foundation model for the Earth system.Nature, 641 ...
-
[8]
Copernicus Climate Data Store, accessed 2026-02-15, doi:10.24381/cds.adbb2d47
H. Hersbach, B. Bell, P. Berrisford, G. Biavati, A. Horányi, J. Muñoz Sabater, J. Nicolas, C. Peubey, R. Radu, I. Rozum, D. Schepers, A. Simmons, C. Soci, D. Dee, and J.-N. Thépaut. ERA5 hourly data on single levels from 1940 to present. Copernicus Climate Change Service (C3S) Climate Data Store (CDS), 2023. URL https://doi.org/10.24381/cds.adbb2d47. 10
-
[9]
Bruinsma, Ana Lucic, Megan Stanley, Anna Vaughan, Johannes Brandstetter, Patrick Garvan, Maik Riechert, Jonathan A
Cristian Bodnar, Wessel P. Bruinsma, Ana Lucic, Megan Stanley, Anna Vaughan, Johannes Brandstetter, Patrick Garvan, Maik Riechert, Jonathan A. Weyn, Haiyu Dong, Jayesh K. Gupta, Kit Thambiratnam, Alexander T. Archibald, Chun-Chieh Wu, Elizabeth Heider, Max Welling, Richard E. Turner, and Paris Perdikaris. A Foundation Model for the Earth System, November
- [10]
-
[11]
Hoffman, Zheng Liu, Jean-Francois Louis, and Christopher Grassoti
Ross N. Hoffman, Zheng Liu, Jean-Francois Louis, and Christopher Grassoti. Distortion Representation of Forecast Errors.Monthly Weather Review, 123(9):2758–2770, September
-
[12]
doi: 10.1175/1520-0493(1995)123<2758:DROFE>2.0.CO
ISSN 1520-0493, 0027-0644. doi: 10.1175/1520-0493(1995)123<2758:DROFE>2.0.CO
-
[13]
URL https://journals.ametsoc.org/view/journals/mwre/123/9/1520-0493_ 1995_123_2758_drofe_2_0_co_2.xml
-
[14]
Christopher Subich, Syed Zahid Husain, Leo Separovic, and Jing Yang. Fixing the Double Penalty in Data-Driven Weather Forecasting Through a Modified Spherical Harmonic Loss Function, May 2025. URLhttp://arxiv.org/abs/2501.19374. arXiv:2501.19374 [cs]
arXiv 2025
-
[15]
Helen F. Dacre, Andrew J. Charlton-Perez, Simon Driscoll, Sue L. Gray, Ben Harvey, Na- talie J. Harvey, Kevin I. Hodges, Kieran M. R. Hunt, and Ambrogio V olontè. Northern hemisphere midlatitude cyclone intensity biases in machine learning weather prediction models.Bulletin of the American Meteorological Society, 107(1):E208–E221, 2026. doi: 10.1175/BAMS-...
-
[16]
Machine learning and physics in weather fore- casting: a discussion between alan thorpe and florian pappenberger.ECMWF – In F o- cus, jun 2024
Florian Pappenberger and Alan Thorpe. Machine learning and physics in weather fore- casting: a discussion between alan thorpe and florian pappenberger.ECMWF – In F o- cus, jun 2024. URL https://www.ecmwf.int/en/about/media-centre/focus/2024/ machine-learning-and-physics-weather-forecasting-discussion-0
2024
-
[17]
On Some Limitations of Current Machine Learning Weather Prediction Models.Geophysical Research Letters, 51(12):e2023GL107377,
Massimo Bonavita. On Some Limitations of Current Machine Learning Weather Prediction Models.Geophysical Research Letters, 51(12):e2023GL107377,
-
[18]
On Some Limitations of Current Machine Learning Weather Prediction Models , volume =
ISSN 1944-8007. doi: 10.1029/2023GL107377. URL https: //onlinelibrary.wiley.com/doi/abs/10.1029/2023GL107377. _eprint: https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1029/2023GL107377
-
[19]
and Scher, Sebastian and Weyn, Jonathan A
Stephan Rasp, Peter D. Dueben, Sebastian Scher, Jonathan A. Weyn, Soukayna Mouatadid, and Nils Thuerey. WeatherBench: A Benchmark Data Set for Data-Driven Weather Forecasting. Journal of Advances in Modeling Earth Systems, 12(11):e2020MS002203, November 2020. ISSN 1942-2466. doi: 10.1029/2020MS002203. URL https://agupubs.onlinelibrary. wiley.com/doi/10.10...
-
[20]
Stephan Rasp, Stephan Hoyer, Alexander Merose, Ian Langmore, Peter Battaglia, Tyler Russel, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, Matthew Chantry, Zied Ben Bouallegue, Peter Dueben, Carla Bromberg, Jared Sisk, Luke Barrington, Aaron Bell, and Fei Sha. WeatherBench 2: A benchmark for the next generation of data- driven global we...
arXiv 2024
-
[21]
Weatherbench-x: A modular framework for evaluating weather forecasts
Stephan Rasp et al. Weatherbench-x: A modular framework for evaluating weather forecasts. https://github.com/google-research/weatherbenchX, 2025
2025
-
[22]
Markus Reichstein, Gustau Camps-Valls, Bjorn Stevens, Martin Jung, Joachim Denzler, Nuno Carvalhais, and Prabhat. Deep learning and process understanding for data-driven earth system science.Nature, 566(7743):195–204, 2019. doi: 10.1038/s41586-019-0912-1
-
[23]
Enforcing analytic constraints in neural networks emulating physical systems.Phys
Tom Beucler, Michael Pritchard, Stephan Rasp, Jordan Ott, Pierre Baldi, and Pierre Gentine. Enforcing analytic constraints in neural networks emulating physical systems.Phys. Rev. Lett., 126:098302, Mar 2021. doi: 10.1103/PhysRevLett.126.098302. URL https://link.aps. org/doi/10.1103/PhysRevLett.126.098302
-
[24]
brightbandtech/ExtremeWeatherBench, January 2026
Brightband. brightbandtech/ExtremeWeatherBench, January 2026. URL https://github. com/brightbandtech/ExtremeWeatherBench. original-date: 2024-08-15T15:33:50Z. 11
2026
-
[25]
Climatelearn: Benchmarking machine learning for weather and climate modeling, 2023
Tung Nguyen, Jason Jewik, Hritik Bansal, Prakhar Sharma, and Aditya Grover. Climatelearn: Benchmarking machine learning for weather and climate modeling, 2023. URL https:// arxiv.org/abs/2307.01909
arXiv 2023
-
[26]
Juan Nathaniel, Yongquan Qu, Tung Nguyen, Sungduk Yu, Julius Busecke, Aditya Grover, and Pierre Gentine. ChaosBench: A Multi-Channel, Physics-Based Benchmark for Subseasonal-to- Seasonal Climate Prediction, November 2024. URL http://arxiv.org/abs/2402.00712. arXiv:2402.00712 [cs]
arXiv 2024
-
[27]
Strictly proper scoring rules, prediction, and estimation
Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007. doi: 10.1198/ 016214506000001437
2007
-
[28]
Vincent Fortin, Mohamed Abaza, François Anctil, and Richard Turcotte. Why should ensemble spread match the RMSE of the ensemble mean?Journal of Hydrometeorology, 15(4):1708– 1713, 2014. doi: 10.1175/JHM-D-14-0008.1
-
[29]
CMCC CMCC-CM2-VHR4 model output prepared for CMIP6 HighResMIP hist-1950
Enrico Scoccimarro, Alessio Bellucci, and Daniele Peano. CMCC CMCC-CM2-VHR4 model output prepared for CMIP6 HighResMIP hist-1950. https://doi.org/10.22033/ESGF/ CMIP6.3818, 2018
-
[30]
Tom Dunstan, Oliver Strickson, Thusal Bennett, Jack Bowyer, Matthew Burnand, James Chap- pell, Alejandro Coca-Castro, Kirstine Ida Dale, Eric G. Daub, Noushin Eftekhari, Manvendra Janmaijaya, Jon Lillis, David Salvador-Jasin, Nathan Simpson, Ryan Sze-Yin Chan, Mohamad Elmasri, Lydia Allegranza France, Sam Madge, Levan Bokeria, Hannah Brown, Tom Dodds, Ann...
arXiv 2025
-
[31]
Carlo Saccardi, Maximilian Pierzyna, Haitz Sáez de Ocáriz Borde, Simone Monaco, Cristian Meo, Pietro Liò, Rudolf Saathof, Geethu Joseph, and Justin Dauwels. Assessing the Geographic Generalization and Physical Consistency of Generative Models for Climate Downscaling, October 2025. URLhttp://arxiv.org/abs/2510.13722. arXiv:2510.13722 [cs]
arXiv 2025
-
[32]
Schreck, William Chapman, and David John Gagne
Yingkai Sha, John S. Schreck, William Chapman, and David John Gagne. Improving AI weather prediction models using global mass and energy conservation schemes, January 2025. URL http://arxiv.org/abs/2501.05648. arXiv:2501.05648 [physics]
arXiv 2025
-
[33]
Hang Fan, Yi Xiao, Yongquan Qu, Fenghua Ling, Ben Fei, Lei Bai, and Pierre Gentine. Incor- porating multivariate consistency in ml-based weather forecasting with latent-space constraints. arXiv preprint arXiv:2510.04006, 2025. URLhttps://arxiv.org/abs/2510.04006
Pith/arXiv arXiv 2025
-
[34]
Neural general circulation models for weather and climate , volume =
Dmitrii Kochkov, Janni Yuval, Ian Langmore, Peter Norgaard, Jamie Smith, Griffin Mooers, Milan Klöwer, James Lottes, Stephan Rasp, Peter Düben, Sam Hatfield, Peter Battaglia, Alvaro Sanchez-Gonzalez, Matthew Willson, Michael P. Brenner, and Stephan Hoyer. Neural general circulation models for weather and climate.Nature, 632(8027):1060–1066, August 2024. I...
-
[35]
ClimODE: Climate and Weather Forecast- ing with Physics-informed Neural ODEs, April 2024
Yogesh Verma, Markus Heinonen, and Vikas Garg. ClimODE: Climate and Weather Forecast- ing with Physics-informed Neural ODEs, April 2024. URL http://arxiv.org/abs/2404. 10024. arXiv:2404.10024 [cs]
arXiv 2024
-
[36]
Akshay Subramaniam, Dale Durran, David Pruitt, Nathaniel Cresswell-Clay, and William Yik. Imposing the Fundamental Dynamical Constraint of Hydrostatic Balance to Improve Global ML Weather Prediction, June 2025. URL http://arxiv.org/abs/2506.08285. arXiv:2506.08285 [physics]
arXiv 2025
-
[37]
David Neelin, Deliang Chen, Jie Feng, Wei Han, Libo Wu, and Yuan Qi
Xiuyu Sun, Xiaohui Zhong, Xiaoze Xu, Yuanqing Huang, Hao Li, J. David Neelin, Deliang Chen, Jie Feng, Wei Han, Libo Wu, and Yuan Qi. FuXi Weather: A data-to-forecast machine learning system for global weather, November 2024. URL http://arxiv.org/abs/2408. 05472. arXiv:2408.05472 [cs]. 12
arXiv 2024
-
[38]
Atmospheric model high resolution forecast (Set I - HRES)
ECMWF. Atmospheric model high resolution forecast (Set I - HRES). ECMWF Forecasts Dataset Documentation, 2024. URL https://www.ecmwf.int/en/forecasts/datasets/ set-i. Accessed: 2024
2024
-
[39]
The mass of the atmosphere: A constraint on global analyses.Journal of Climate, 18(6):864–875, 2005
Kevin E Trenberth and Lesley Smith. The mass of the atmosphere: A constraint on global analyses.Journal of Climate, 18(6):864–875, 2005
2005
-
[40]
G. D. Nastrom and K. S. Gage. A climatology of atmospheric wavenumber spectra of wind and temperature observed by commercial aircraft.Journal of the Atmospheric Sciences, 42(9): 950–960, 1985. doi: 10.1175/1520-0469(1985)042<0950:ACOAWS>2.0.CO;2
-
[41]
Academic press, 5th edition, 2012
James R Holton and Gregory J Hakim.An Introduction to Dynamic Meteorology. Academic press, 5th edition, 2012
2012
-
[42]
Peter H. Stone and John H. Carlson. Atmospheric lapse rate regimes and their parameterization. Journal of Atmospheric Sciences, 36(3):415 – 423, 1979. doi: 10.1175/1520-0469(1979) 036<0415:ALRRAT>2.0.CO;2. URL https://journals.ametsoc.org/view/journals/ atsc/36/3/1520-0469_1979_036_0415_alrrat_2_0_co_2.xml
-
[43]
Arch- esweather: An efficient ai weather forecasting model at 1.5° resolution, 2024
Guillaume Couairon, Christian Lessig, Anastase Charantonis, and Claire Monteleoni. Arch- esweather: An efficient ai weather forecasting model at 1.5° resolution, 2024. URL https: //arxiv.org/abs/2405.14527
arXiv 2024
-
[44]
Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R. Andersson, Andrew El-Kadi, Do- minic Masters, Timo Ewalds, Jacklynn Stott, Shakir Mohamed, Peter Battaglia, Remi Lam, and Matthew Willson. GenCast: Diffusion-based ensemble forecasting for medium-range weather, May 2024. URLhttp://arxiv.org/abs/2312.15796. arXiv:2312.15796 [cs]
arXiv 2024
-
[45]
energy creation
NOAA, NASA, and USAF.U.S. Standard Atmosphere, 1976. U.S. Government Printing Office, Washington, D.C., 1976. 13 A Appendix A: Granular Preprocessing and Integration Formalisms A.1 Boundary Derivation Strategies To compute the vertical column integrals detailed in the main text, a lower boundary (surface pressure, ps) is required. Because MLWP models exhi...
1976
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.