pith. sign in

arxiv: 2606.26361 · v1 · pith:IHEX6AF2new · submitted 2026-06-24 · 💻 cs.LG · physics.ao-ph

Does Aurora Encode Atmospheric Structure? Latent Regime Analysis and Attribution

Pith reviewed 2026-06-26 01:31 UTC · model grok-4.3

classification 💻 cs.LG physics.ao-ph
keywords Aurora modellatent space analysisatmospheric modelingexplainable AIlayer-wise relevance propagationprincipal component analysisweather forecastingvertical structure
0
0 comments X

The pith

Aurora's latent space organizes around seasonal cycles and attends to three-dimensional vertical atmospheric features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies spatially pooled principal component analysis and layer-wise relevance propagation to open up the internal representations of the Aurora weather model. It finds that the dominant organization in the latent space follows seasonal cycles, while extreme storm events do not appear as distinct linear clusters. Relevance propagation highlights that the model focuses on vertical structure in the atmosphere during the Great Storm of 1987, and tests confirm that masking those highlighted regions harms forecast skill far more than masking random areas. This work shows that the model acquires meteorological coherence and vertical awareness through training alone.

Core claim

Aurora's latent space is primarily organized by seasonal cycles, extreme storm events do not form a linearly separable cluster, LRP indicates attention to 3D vertical structure of the Great Storm of 1987, and masking relevant regions degrades forecasts 3.31 times more than random masking. These findings suggest that Aurora learns meteorological coherence and vertical structure without explicit instruction.

What carries the argument

Spatially pooled principal component analysis paired with layer-wise relevance propagation to map and attribute importance in the model's latent representations of atmospheric data.

If this is right

  • The model captures seasonal atmospheric cycles as a primary organizing principle in its internal space.
  • Extreme weather events are not treated as a separate category in the learned representations.
  • Attention to vertical layering in the atmosphere contributes measurably to prediction performance.
  • Targeted masking based on relevance scores produces a much larger drop in skill than uniform random masking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same analysis methods could be applied to other foundation models to check whether they also encode vertical atmospheric structure.
  • If the seasonal organization holds across different training datasets, it would suggest the model has extracted a general physical regularity rather than dataset-specific patterns.
  • One could test whether altering the vertical resolution of input data changes the relevance maps in a predictable way.

Load-bearing premise

The patterns identified by the analysis methods reflect the model's actual learned understanding of atmospheric physics instead of being shaped mainly by data preparation steps or the specific events chosen for study.

What would settle it

Finding that masking the regions flagged by layer-wise relevance propagation degrades forecast accuracy no more than masking random regions of equal size, or that the leading principal components of the latent space fail to align with seasonal variations across multiple years.

Figures

Figures reproduced from arXiv: 2606.26361 by Ana Lucic, Emma Kasteleyn.

Figure 1
Figure 1. Figure 1: PCA Analysis. (a) Seasons show distinct clustering, unlike (b) storms. 3 LOCAL ATTRIBUTION AND VERTICAL STRUCTURE (Q2) To answer Q2, we apply LRP to the Great Storm of 1987, a classic extratropical cyclone. We address the specific architectural challenge of the Swin V2 backbone – shifted window self-attention – by wrapping the attention mechanism to preserve the computational graph during cyclic spatial sh… view at source ↗
Figure 2
Figure 2. Figure 2: Surface variable relevance. 1987 Great Storm: Model focuses on dynamic frontal fea￾tures. Level 0 (Top) 4000 2000 0 2000 4000 Level 2 4000 2000 0 2000 4000 Level 4 4000 2000 0 2000 4000 Level 6 4000 2000 0 2000 4000 Level 8 4000 2000 0 2000 4000 Level 10 4000 2000 0 2000 4000 Level 12 (Bottom) 4000 2000 0 2000 4000 Surface (10u) 4000 2000 0 2000 4000 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Atmospheric variable relevance. 1987 Great Storm: Model captures frontal boundaries on multiple levels. 4 CONCLUSION This study investigated the internal representations of the Aurora foundation model through two complementary lenses. Regarding global organization (RQ1), results suggest that Aurora captures the annual seasonal cycle, while storms are not encoded in the first levels of PCA. LRP analysis sug… view at source ↗
Figure 4
Figure 4. Figure 4: Higher-Order Latent Components (PC2 vs. PC3). (a) Seasonal clusters become less distinct. (b) Storm/calm regimes show no separability. C.3 CONTRASTIVE PROJECTION (a) Fall vs. spring (b) Storm vs. calm [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Contrastive projections. (a) Fall–spring projection showing shared support region. (b) Storm–calm projection showing partial separation. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: LRP Relevance Maps. 2020 Baseline (1 Jan): Model focuses shifts to static geography. C.5 BASELINE LEVELS LRP Level 0 (Top) 20000 10000 0 10000 20000 Level 2 20000 10000 0 10000 20000 Level 4 20000 10000 0 10000 20000 Level 6 20000 10000 0 10000 20000 Level 8 20000 10000 0 10000 20000 Level 10 20000 10000 0 10000 20000 Level 12 (Bottom) 20000 10000 0 10000 20000 Surface (10u) 20000 10000 0 10000 20000 [PIT… view at source ↗
Figure 7
Figure 7. Figure 7: Surface U-Wind Relevance. 2020 Baseline (1 Jan): Model mainly captures wind rele￾vance on lower levels. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

ML foundation models are able to emulate atmospheric dynamics accurately and efficiently but operate as opaque ``black boxes''. We investigate the internal representations of the Aurora model using spatially pooled PCA and layer-wise relevance propagation (LRP). We find evidence that Aurora's latent space is primarily organized by seasonal cycles, whereas extreme storm events do not form a linearly separable cluster. LRP indicates that the model attends to features consistent with the 3D vertical structure of the Great Storm of 1987. Perturbation tests show masking relevant regions degrades forecasts $3.31\times$ more than random masking. These findings suggest that Aurora learns meteorological coherence and vertical structure without explicit instruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper examines the latent representations of the Aurora atmospheric foundation model using spatially pooled principal component analysis (PCA) and layer-wise relevance propagation (LRP). It reports that the latent space is primarily structured by seasonal cycles, that extreme storm events do not form linearly separable clusters, that LRP attributes relevance to 3D vertical structures in events like the Great Storm of 1987, and that masking regions identified as relevant by LRP degrades forecast performance by a factor of 3.31 compared to random masking.

Significance. If the empirical findings hold under rigorous controls, this work would contribute to interpretability of ML foundation models for weather by showing evidence of unsupervised encoding of seasonal cycles and vertical atmospheric coherence. The perturbation-based validation of LRP attributions provides a concrete test of the claims.

major comments (3)
  1. [Abstract] Abstract: The 3.31× forecast degradation claim from the masking experiment is presented without error bars, number of trials, mask-size controls, or statistical tests; this quantitative result is load-bearing for the attribution validation but cannot be assessed for robustness or artifact from event selection.
  2. [PCA analysis] PCA analysis section: The claim that spatially pooled PCA isolates model-encoded seasonal structure lacks a control comparison of PCA on raw input fields (or shuffled-season data) to rule out that the observed axes simply recover input distribution statistics rather than learned representations.
  3. [LRP attribution] LRP attribution section: Attribution to 3D vertical structure is reported without analysis of sensitivity to LRP rule choice, layer-specific stabilization parameters, or input scaling; given known instabilities in LRP, this is required to establish that the maps reflect Aurora's learned coherence rather than method artifacts.
minor comments (2)
  1. [Abstract] Abstract and methods: The spatial pooling operation, number of retained components, and exact dataset (reanalysis product, time range, variables) are not specified, hindering reproducibility.
  2. [References] The manuscript would benefit from explicit citation of standard LRP references (Bach et al. 2015) and prior interpretability studies on atmospheric ML models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, indicating revisions where appropriate to improve robustness and clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The 3.31× forecast degradation claim from the masking experiment is presented without error bars, number of trials, mask-size controls, or statistical tests; this quantitative result is load-bearing for the attribution validation but cannot be assessed for robustness or artifact from event selection.

    Authors: We agree that the masking result requires statistical support. In revision we will report the number of trials, error bars across trials, explicit mask-size controls, and a statistical test comparing relevant vs. random masking. These details will be added to both the abstract and the results section. revision: yes

  2. Referee: [PCA analysis] PCA analysis section: The claim that spatially pooled PCA isolates model-encoded seasonal structure lacks a control comparison of PCA on raw input fields (or shuffled-season data) to rule out that the observed axes simply recover input distribution statistics rather than learned representations.

    Authors: The suggested control is appropriate. We will add PCA on raw input fields and on season-shuffled data to the revised PCA section, allowing direct comparison that isolates the contribution of the learned latent representations. revision: yes

  3. Referee: [LRP attribution] LRP attribution section: Attribution to 3D vertical structure is reported without analysis of sensitivity to LRP rule choice, layer-specific stabilization parameters, or input scaling; given known instabilities in LRP, this is required to establish that the maps reflect Aurora's learned coherence rather than method artifacts.

    Authors: We acknowledge known LRP instabilities. We will add a limited sensitivity check across the two primary LRP rules employed and document the stabilization parameters and input scaling used. The existing perturbation validation already provides an independent test of attribution quality; the added analysis will further address method dependence. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical analysis with no derivations or self-referential fits

full rationale

The paper consists entirely of empirical investigation of an existing model's latent space via standard tools (spatially pooled PCA and LRP) plus a perturbation test. No equations, parameter fitting presented as prediction, uniqueness theorems, or ansatzes appear. Central claims rest on direct application of these methods to Aurora outputs and observable degradation ratios, remaining independent of any self-citation chain or definitional loop. This is the normal case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, axioms, or invented entities are introduced; the contribution is an empirical interpretability study whose claims rest on the validity of the chosen analysis techniques.

pith-pipeline@v0.9.1-grok · 5631 in / 1119 out tokens · 22731 ms · 2026-06-26T01:31:09.361248+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 5 canonical work pages

  1. [1]

    Sebastian Bach, Alexander Binder, Gr´egoire Montavon, Frederick Klauschen, Klaus-Robert M¨uller, and Wojciech Samek

    URLhttps://arxiv.org/abs/2106.13200. Sebastian Bach, Alexander Binder, Gr´egoire Montavon, Frederick Klauschen, Klaus-Robert M¨uller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PLOS ONE, 10(7):e0130140,

  2. [2]

    URLhttps://doi.org/10.1371/journal.pone.0130140

    doi: 10.1371/journal.pone.0130140. URLhttps://doi.org/10.1371/journal.pone.0130140. Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Pangu-weather: A 3d high-resolution model for fast and accurate global weather forecast,

  3. [3]

    Alexander Binder, Gr´egoire Montavon, Sebastian Bach, Klaus-Robert M¨uller, and Wojciech Samek

    URLhttps: //arxiv.org/abs/2211.02556. Alexander Binder, Gr´egoire Montavon, Sebastian Bach, Klaus-Robert M¨uller, and Wojciech Samek. Layer-wise relevance propagation for neural networks with local renormalization layers,

  4. [4]

    Cristian Bodnar, Wessel P

    URLhttps://arxiv.org/abs/1604.00825. Cristian Bodnar, Wessel P. Bruinsma, Ana Lucic, Megan Stanley, Anna Vaughan, Johannes Brand- stetter, Patrick Garvan, Maik Riechert, Jonathan A. Weyn, Haiyu Dong, Jayesh K. Gupta, Kit Thambiratnam, Alexander T. Archibald, Chun-Chieh Wu, Elizabeth Heider, Max Welling, Richard E. Turner, and Paris Perdikaris. A foundatio...

  5. [5]

    Walid Bousselham, Angie Boggust, Sofian Chaybouti, Hendrik Strobelt, and Hilde Kuehne

    URL https://arxiv.org/abs/2405.13063. Walid Bousselham, Angie Boggust, Sofian Chaybouti, Hendrik Strobelt, and Hilde Kuehne. Legrad: An explainability method for vision transformers via feature formation sensitivity,

  6. [6]

    URL https://arxiv.org/abs/2404.03214. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yuhuai Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan ...

  7. [7]

    pub/2023/monosemantic-features/index.html

    URLhttps://transformer-circuits. pub/2023/monosemantic-features/index.html. Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost,

  8. [8]

    URLhttps://arxiv.org/abs/1604.06174. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszko- reit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale,

  9. [9]

    Imme Ebert-Uphoff and Kyle A

    URLhttps://arxiv.org/abs/2010.11929. Imme Ebert-Uphoff and Kyle A. Hilburn. Evaluation, tuning and interpretation of neural networks for meteorological applications,

  10. [10]

    Carl Eckart and Gale Young

    URLhttps://arxiv.org/abs/2005.03126. Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank.Psychome- trika, 1(3):211–218,

  11. [11]

    doi: 10.1007/BF02288367. ECMWF. IFS Documentation CY49R1. Technical report, ECMWF,

  12. [12]

    org/10.21957/956d60ad81

    URLhttps://doi. org/10.21957/956d60ad81. ECWMF. Section 2.1.2.4 HRES - High Resolution Forecasts. ECMWF Forecast User Guide,

  13. [13]

    4+HRES+-+High+Resolution+Forecasts

    URLhttps://confluence.ecmwf.int/display/FUG/Section+2.1.2. 4+HRES+-+High+Resolution+Forecasts. 5 Published as a workshop paper at SciForDL 2nd edition H. Hersbach, B. Bell, P. Berrisford, G. Biavati, A. Hor´anyi, J. Mu˜noz Sabater, J. Nicolas, C. Peubey, R. Radu, I. Rozum, D. Schepers, A. Simmons, C. Soci, D. Dee, and J.-N. Th´epaut. ERA5 hourly data on s...

  14. [14]

    Copernicus Climate Data Store, accessed 2026-02-15, doi:10.24381/cds.adbb2d47

    URLhttps://doi.org/10.24381/cds.adbb2d47. Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Joao Carreira. Perceiver: General perception with iterative attention,

  15. [15]

    URLhttps://arxiv.org/ abs/2103.03206. Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Fer- ran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, Alexander Merose, Stephan Hoyer, George Holland, Oriol Vinyals, Jacklynn Stott, Alexander Pritzel, Shakir Mo- hamed, and Peter Battaglia. Graphcast: Learning sk...

  16. [16]

    Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo

    URLhttps://arxiv.org/abs/2212.12794. Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. Swin transformer v2: Scaling up capacity and resolution,

  17. [17]

    Scott Lundberg and Su-In Lee

    URLhttps://arxiv.org/abs/2111.09883. Scott Lundberg and Su-In Lee. A unified approach to interpreting model predictions,

  18. [18]

    Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel

    URL https://arxiv.org/abs/1705.07874. Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolutional neural networks. InAdvances in Neural Information Processing Sys- tems,

  19. [20]

    Alexander Modell, Patrick Rubin-Delanchy, and Nick Whiteley

    URLhttps://arxiv.org/abs/2512.24440. Alexander Modell, Patrick Rubin-Delanchy, and Nick Whiteley. The origins of representation man- ifolds in large language models,

  20. [21]

    URLhttps://arxiv.org/abs/2505.18235. Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, Pedram Hassanzadeh, Karthik Kashinath, and Animashree Anandkumar. Fourcastnet: A global data- driven high-resolution weather model using adaptive fo...

  21. [22]

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

    URL https://arxiv.org/abs/2202.11214. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models,

  22. [23]

    URLhttps://arxiv.org/abs/1910. 02054. Benjamin Richards and Pushpa Kumar Balan. Physical consistency of aurora’s encoder: A quanti- tative study,

  23. [24]

    Royal Meteorological Society

    URLhttps://arxiv.org/abs/2511.07787. Royal Meteorological Society. The Beaufort Wind Scale, n.d. URLhttps://www.rmets. org/metmatters/beaufort-wind-scale. Wojciech Samek, Alexander Binder, Gr´egoire Montavon, Sebastian Bach, and Klaus-Robert M¨uller. Evaluating the visualization of what a deep neural network has learned,

  24. [25]

    Enrico Scoccimarro, Alessio Bellucci, and Daniele Peano

    URLhttps: //arxiv.org/abs/1509.06321. Enrico Scoccimarro, Alessio Bellucci, and Daniele Peano. CMCC CMCC-CM2-VHR4 model output prepared for CMIP6 HighResMIP hist-1950,

  25. [26]

    Ruyi Yang, Jingyu Hu, Zihao Li, Jianli Mu, Tingzhao Yu, Jiangjiang Xia, Xuhong Li, Aritra Das- gupta, and Haoyi Xiong

    doi: 10.1029/2019MS002002. Ruyi Yang, Jingyu Hu, Zihao Li, Jianli Mu, Tingzhao Yu, Jiangjiang Xia, Xuhong Li, Aritra Das- gupta, and Haoyi Xiong. Interpretable machine learning for weather and climate prediction: A survey,

  26. [27]

    6 Published as a workshop paper at SciForDL 2nd edition A THEAURORAFOUNDATIONMODELARCHITECTURE Aurora (Bodnar et al.,

    URLhttps://arxiv.org/abs/2403.18864. 6 Published as a workshop paper at SciForDL 2nd edition A THEAURORAFOUNDATIONMODELARCHITECTURE Aurora (Bodnar et al.,

  27. [28]

    A.3 PRE-TRAININGOBJECTIVE The model was pre-trained on a massive corpus of heterogeneous data, including ERA5 reanalysis (Hersbach et al.,

    and ZeRO- based optimizations (Rajbhandari et al., 2020). A.3 PRE-TRAININGOBJECTIVE The model was pre-trained on a massive corpus of heterogeneous data, including ERA5 reanalysis (Hersbach et al.,

  28. [29]

    and HRES operational forecasts (ECWMF, 2024), for approximately 150k steps using a Mean Absolute Error (MAE) objective. This pre-training forces the model to learn a compressed, physically consistent representation of atmospheric dynamics, which we probe in this study via the fixed weights of theAuroraSmallPretrainedcheckpoint. B EXTENDEDMETHODOLOGY ANDIM...

  29. [30]

    to define a composite rule set that handles the heterogeneous layers of the Swin Transformer V2 U-Net. The relevance propagation rulesR j = P k zjkP j zjk Rk are parametrized as follows: •Convolutional and Linear Layers:We apply the LRP-ϵrule with a stabilizer termϵ= 0.25to dampen noise and prevent numerical instability when activations approach zero: Rj ...