Does Aurora Encode Atmospheric Structure? Latent Regime Analysis and Attribution
Pith reviewed 2026-06-26 01:31 UTC · model grok-4.3
The pith
Aurora's latent space organizes around seasonal cycles and attends to three-dimensional vertical atmospheric features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Aurora's latent space is primarily organized by seasonal cycles, extreme storm events do not form a linearly separable cluster, LRP indicates attention to 3D vertical structure of the Great Storm of 1987, and masking relevant regions degrades forecasts 3.31 times more than random masking. These findings suggest that Aurora learns meteorological coherence and vertical structure without explicit instruction.
What carries the argument
Spatially pooled principal component analysis paired with layer-wise relevance propagation to map and attribute importance in the model's latent representations of atmospheric data.
If this is right
- The model captures seasonal atmospheric cycles as a primary organizing principle in its internal space.
- Extreme weather events are not treated as a separate category in the learned representations.
- Attention to vertical layering in the atmosphere contributes measurably to prediction performance.
- Targeted masking based on relevance scores produces a much larger drop in skill than uniform random masking.
Where Pith is reading between the lines
- The same analysis methods could be applied to other foundation models to check whether they also encode vertical atmospheric structure.
- If the seasonal organization holds across different training datasets, it would suggest the model has extracted a general physical regularity rather than dataset-specific patterns.
- One could test whether altering the vertical resolution of input data changes the relevance maps in a predictable way.
Load-bearing premise
The patterns identified by the analysis methods reflect the model's actual learned understanding of atmospheric physics instead of being shaped mainly by data preparation steps or the specific events chosen for study.
What would settle it
Finding that masking the regions flagged by layer-wise relevance propagation degrades forecast accuracy no more than masking random regions of equal size, or that the leading principal components of the latent space fail to align with seasonal variations across multiple years.
Figures
read the original abstract
ML foundation models are able to emulate atmospheric dynamics accurately and efficiently but operate as opaque ``black boxes''. We investigate the internal representations of the Aurora model using spatially pooled PCA and layer-wise relevance propagation (LRP). We find evidence that Aurora's latent space is primarily organized by seasonal cycles, whereas extreme storm events do not form a linearly separable cluster. LRP indicates that the model attends to features consistent with the 3D vertical structure of the Great Storm of 1987. Perturbation tests show masking relevant regions degrades forecasts $3.31\times$ more than random masking. These findings suggest that Aurora learns meteorological coherence and vertical structure without explicit instruction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines the latent representations of the Aurora atmospheric foundation model using spatially pooled principal component analysis (PCA) and layer-wise relevance propagation (LRP). It reports that the latent space is primarily structured by seasonal cycles, that extreme storm events do not form linearly separable clusters, that LRP attributes relevance to 3D vertical structures in events like the Great Storm of 1987, and that masking regions identified as relevant by LRP degrades forecast performance by a factor of 3.31 compared to random masking.
Significance. If the empirical findings hold under rigorous controls, this work would contribute to interpretability of ML foundation models for weather by showing evidence of unsupervised encoding of seasonal cycles and vertical atmospheric coherence. The perturbation-based validation of LRP attributions provides a concrete test of the claims.
major comments (3)
- [Abstract] Abstract: The 3.31× forecast degradation claim from the masking experiment is presented without error bars, number of trials, mask-size controls, or statistical tests; this quantitative result is load-bearing for the attribution validation but cannot be assessed for robustness or artifact from event selection.
- [PCA analysis] PCA analysis section: The claim that spatially pooled PCA isolates model-encoded seasonal structure lacks a control comparison of PCA on raw input fields (or shuffled-season data) to rule out that the observed axes simply recover input distribution statistics rather than learned representations.
- [LRP attribution] LRP attribution section: Attribution to 3D vertical structure is reported without analysis of sensitivity to LRP rule choice, layer-specific stabilization parameters, or input scaling; given known instabilities in LRP, this is required to establish that the maps reflect Aurora's learned coherence rather than method artifacts.
minor comments (2)
- [Abstract] Abstract and methods: The spatial pooling operation, number of retained components, and exact dataset (reanalysis product, time range, variables) are not specified, hindering reproducibility.
- [References] The manuscript would benefit from explicit citation of standard LRP references (Bach et al. 2015) and prior interpretability studies on atmospheric ML models.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below, indicating revisions where appropriate to improve robustness and clarity.
read point-by-point responses
-
Referee: [Abstract] Abstract: The 3.31× forecast degradation claim from the masking experiment is presented without error bars, number of trials, mask-size controls, or statistical tests; this quantitative result is load-bearing for the attribution validation but cannot be assessed for robustness or artifact from event selection.
Authors: We agree that the masking result requires statistical support. In revision we will report the number of trials, error bars across trials, explicit mask-size controls, and a statistical test comparing relevant vs. random masking. These details will be added to both the abstract and the results section. revision: yes
-
Referee: [PCA analysis] PCA analysis section: The claim that spatially pooled PCA isolates model-encoded seasonal structure lacks a control comparison of PCA on raw input fields (or shuffled-season data) to rule out that the observed axes simply recover input distribution statistics rather than learned representations.
Authors: The suggested control is appropriate. We will add PCA on raw input fields and on season-shuffled data to the revised PCA section, allowing direct comparison that isolates the contribution of the learned latent representations. revision: yes
-
Referee: [LRP attribution] LRP attribution section: Attribution to 3D vertical structure is reported without analysis of sensitivity to LRP rule choice, layer-specific stabilization parameters, or input scaling; given known instabilities in LRP, this is required to establish that the maps reflect Aurora's learned coherence rather than method artifacts.
Authors: We acknowledge known LRP instabilities. We will add a limited sensitivity check across the two primary LRP rules employed and document the stabilization parameters and input scaling used. The existing perturbation validation already provides an independent test of attribution quality; the added analysis will further address method dependence. revision: partial
Circularity Check
No circularity: empirical analysis with no derivations or self-referential fits
full rationale
The paper consists entirely of empirical investigation of an existing model's latent space via standard tools (spatially pooled PCA and LRP) plus a perturbation test. No equations, parameter fitting presented as prediction, uniqueness theorems, or ansatzes appear. Central claims rest on direct application of these methods to Aurora outputs and observable degradation ratios, remaining independent of any self-citation chain or definitional loop. This is the normal case of a self-contained empirical study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2106.13200. Sebastian Bach, Alexander Binder, Gr´egoire Montavon, Frederick Klauschen, Klaus-Robert M¨uller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PLOS ONE, 10(7):e0130140,
-
[2]
URLhttps://doi.org/10.1371/journal.pone.0130140
doi: 10.1371/journal.pone.0130140. URLhttps://doi.org/10.1371/journal.pone.0130140. Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Pangu-weather: A 3d high-resolution model for fast and accurate global weather forecast,
-
[3]
Alexander Binder, Gr´egoire Montavon, Sebastian Bach, Klaus-Robert M¨uller, and Wojciech Samek
URLhttps: //arxiv.org/abs/2211.02556. Alexander Binder, Gr´egoire Montavon, Sebastian Bach, Klaus-Robert M¨uller, and Wojciech Samek. Layer-wise relevance propagation for neural networks with local renormalization layers,
-
[4]
URLhttps://arxiv.org/abs/1604.00825. Cristian Bodnar, Wessel P. Bruinsma, Ana Lucic, Megan Stanley, Anna Vaughan, Johannes Brand- stetter, Patrick Garvan, Maik Riechert, Jonathan A. Weyn, Haiyu Dong, Jayesh K. Gupta, Kit Thambiratnam, Alexander T. Archibald, Chun-Chieh Wu, Elizabeth Heider, Max Welling, Richard E. Turner, and Paris Perdikaris. A foundatio...
-
[5]
Walid Bousselham, Angie Boggust, Sofian Chaybouti, Hendrik Strobelt, and Hilde Kuehne
URL https://arxiv.org/abs/2405.13063. Walid Bousselham, Angie Boggust, Sofian Chaybouti, Hendrik Strobelt, and Hilde Kuehne. Legrad: An explainability method for vision transformers via feature formation sensitivity,
-
[6]
URL https://arxiv.org/abs/2404.03214. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yuhuai Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan ...
-
[7]
pub/2023/monosemantic-features/index.html
URLhttps://transformer-circuits. pub/2023/monosemantic-features/index.html. Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost,
2023
-
[8]
URLhttps://arxiv.org/abs/1604.06174. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszko- reit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale,
-
[9]
URLhttps://arxiv.org/abs/2010.11929. Imme Ebert-Uphoff and Kyle A. Hilburn. Evaluation, tuning and interpretation of neural networks for meteorological applications,
Pith/arXiv arXiv 2010
-
[10]
URLhttps://arxiv.org/abs/2005.03126. Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank.Psychome- trika, 1(3):211–218,
arXiv 2005
-
[11]
doi: 10.1007/BF02288367. ECMWF. IFS Documentation CY49R1. Technical report, ECMWF,
-
[12]
URLhttps://doi. org/10.21957/956d60ad81. ECWMF. Section 2.1.2.4 HRES - High Resolution Forecasts. ECMWF Forecast User Guide,
-
[13]
4+HRES+-+High+Resolution+Forecasts
URLhttps://confluence.ecmwf.int/display/FUG/Section+2.1.2. 4+HRES+-+High+Resolution+Forecasts. 5 Published as a workshop paper at SciForDL 2nd edition H. Hersbach, B. Bell, P. Berrisford, G. Biavati, A. Hor´anyi, J. Mu˜noz Sabater, J. Nicolas, C. Peubey, R. Radu, I. Rozum, D. Schepers, A. Simmons, C. Soci, D. Dee, and J.-N. Th´epaut. ERA5 hourly data on s...
1940
-
[14]
Copernicus Climate Data Store, accessed 2026-02-15, doi:10.24381/cds.adbb2d47
URLhttps://doi.org/10.24381/cds.adbb2d47. Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Joao Carreira. Perceiver: General perception with iterative attention,
-
[15]
URLhttps://arxiv.org/ abs/2103.03206. Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Fer- ran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, Alexander Merose, Stephan Hoyer, George Holland, Oriol Vinyals, Jacklynn Stott, Alexander Pritzel, Shakir Mo- hamed, and Peter Battaglia. Graphcast: Learning sk...
-
[16]
URLhttps://arxiv.org/abs/2212.12794. Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. Swin transformer v2: Scaling up capacity and resolution,
-
[17]
URLhttps://arxiv.org/abs/2111.09883. Scott Lundberg and Su-In Lee. A unified approach to interpreting model predictions,
-
[18]
Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel
URL https://arxiv.org/abs/1705.07874. Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolutional neural networks. InAdvances in Neural Information Processing Sys- tems,
-
[20]
Alexander Modell, Patrick Rubin-Delanchy, and Nick Whiteley
URLhttps://arxiv.org/abs/2512.24440. Alexander Modell, Patrick Rubin-Delanchy, and Nick Whiteley. The origins of representation man- ifolds in large language models,
-
[21]
URLhttps://arxiv.org/abs/2505.18235. Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, Pedram Hassanzadeh, Karthik Kashinath, and Animashree Anandkumar. Fourcastnet: A global data- driven high-resolution weather model using adaptive fo...
-
[22]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He
URL https://arxiv.org/abs/2202.11214. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models,
-
[23]
URLhttps://arxiv.org/abs/1910. 02054. Benjamin Richards and Pushpa Kumar Balan. Physical consistency of aurora’s encoder: A quanti- tative study,
1910
-
[24]
URLhttps://arxiv.org/abs/2511.07787. Royal Meteorological Society. The Beaufort Wind Scale, n.d. URLhttps://www.rmets. org/metmatters/beaufort-wind-scale. Wojciech Samek, Alexander Binder, Gr´egoire Montavon, Sebastian Bach, and Klaus-Robert M¨uller. Evaluating the visualization of what a deep neural network has learned,
-
[25]
Enrico Scoccimarro, Alessio Bellucci, and Daniele Peano
URLhttps: //arxiv.org/abs/1509.06321. Enrico Scoccimarro, Alessio Bellucci, and Daniele Peano. CMCC CMCC-CM2-VHR4 model output prepared for CMIP6 HighResMIP hist-1950,
Pith/arXiv arXiv 1950
-
[26]
doi: 10.1029/2019MS002002. Ruyi Yang, Jingyu Hu, Zihao Li, Jianli Mu, Tingzhao Yu, Jiangjiang Xia, Xuhong Li, Aritra Das- gupta, and Haoyi Xiong. Interpretable machine learning for weather and climate prediction: A survey,
-
[27]
URLhttps://arxiv.org/abs/2403.18864. 6 Published as a workshop paper at SciForDL 2nd edition A THEAURORAFOUNDATIONMODELARCHITECTURE Aurora (Bodnar et al.,
-
[28]
A.3 PRE-TRAININGOBJECTIVE The model was pre-trained on a massive corpus of heterogeneous data, including ERA5 reanalysis (Hersbach et al.,
and ZeRO- based optimizations (Rajbhandari et al., 2020). A.3 PRE-TRAININGOBJECTIVE The model was pre-trained on a massive corpus of heterogeneous data, including ERA5 reanalysis (Hersbach et al.,
2020
-
[29]
and HRES operational forecasts (ECWMF, 2024), for approximately 150k steps using a Mean Absolute Error (MAE) objective. This pre-training forces the model to learn a compressed, physically consistent representation of atmospheric dynamics, which we probe in this study via the fixed weights of theAuroraSmallPretrainedcheckpoint. B EXTENDEDMETHODOLOGY ANDIM...
2024
-
[30]
to define a composite rule set that handles the heterogeneous layers of the Swin Transformer V2 U-Net. The relevance propagation rulesR j = P k zjkP j zjk Rk are parametrized as follows: •Convolutional and Linear Layers:We apply the LRP-ϵrule with a stabilizer termϵ= 0.25to dampen noise and prevent numerical instability when activations approach zero: Rj ...
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.