pith. sign in

arxiv: 2606.09941 · v1 · pith:OMKCR4CVnew · submitted 2026-06-08 · 📊 stat.AP · cs.LG· stat.OT

Stochastic weather generators for high-frequency wind vector time series

Pith reviewed 2026-06-27 14:58 UTC · model grok-4.3

classification 📊 stat.AP cs.LGstat.OT
keywords stochastic weather generatorwind vector time serieshigh-frequency datadiurnal patternsVQ-VAEextreme value distributionminute-scale observations
0
0 comments X

The pith

Machine learning models using vector-quantized autoencoders generate minute-scale wind vector time series that capture diurnal volatility changes but fail to match extreme wind speed distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops stochastic generators for high-frequency surface wind vectors at one Oklahoma site during June using time vector-quantized variational autoencoders. These models produce daily sequences either unconditionally or conditioned on the prior day, with optional discrete weather state inputs, to replicate complex observed patterns in speed and direction that standard time series methods miss. The work shows that the best generators reproduce diurnal shifts in wind volatility while falling short on the tails of the wind speed distribution. Such generators could supply realistic inputs to models in wind energy, wildfire spread, and aviation. The evaluation combines formal metrics with visual checks across more than thirty years of minute-scale observations.

Core claim

This work develops a range of machine learning models for generating realistic time series of surface wind vectors at a site in Lamont, Oklahoma based on more than 30 years of high quality measurements at the minute time scale. The data show complex diurnal structures in both wind speed and direction that would be challenging to capture with standard time series models, so we consider a number of machine learning approaches to producing a stochastic wind generator based on time vector-quantized variational autoencoders. We consider generating a day's worth of data at a time and generating a day of wind vectors conditional on the previous day's winds. We also study methods for incorporating a

What carries the argument

Time vector-quantized variational autoencoders (VQ-VAE) that generate daily wind vector sequences, either unconditionally or conditioned on the previous day's winds and optional discrete weather states.

If this is right

  • The generators can supply minute-scale wind inputs to downstream models in wind energy, wildfire spread, and aviation.
  • Diurnal volatility patterns in wind speed and direction are reproduced accurately enough for many applications.
  • Extreme wind speed tails remain mismatched, limiting use in risk-sensitive settings.
  • Incorporating weather state variables improves some features but does not resolve the extreme-value shortfall.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same VQ-VAE conditioning approach could be tested on data from other months or sites to check whether the diurnal capture generalizes beyond the June restriction.
  • Better extreme-value modeling might require hybrid methods that combine the current generators with separate tail models.
  • If the diurnal volatility match holds, these generators could reduce reliance on parametric assumptions in high-frequency wind simulations for operational forecasting.

Load-bearing premise

That restricting analysis to a single site and the month of June, combined with the VQ-VAE architecture and chosen conditioning schemes, is sufficient to capture the full range of complex diurnal structures present in the minute-scale observations.

What would settle it

Compare the distribution of generated extreme wind speeds against held-out minute-scale observations from the same site in June; a clear mismatch in the upper tail would falsify the claim that the generators reproduce observed extremes.

Figures

Figures reproduced from arXiv: 2606.09941 by Abolfazl Sodagartojgi, Gemma E. Moran, Justin T. Greene, Kevin Eng, Michael L. Stein, Mingshi Cui, Zern Ke, Zhiqiu Xia.

Figure 1
Figure 1. Figure 1: Diurnal cycle in wind speed (m s−1 ) for four months of the year. The left plot shows ten-minute averages of minute-by-minute median wind speeds while the right plot shows the ten-minute averages of minute-by-minute interquartile ranges (IQR). The previous observations suggest the daily wind cycle changes fundamentally depending on the time of year. Because of this complexity, simpler modeling approaches, … view at source ↗
Figure 2
Figure 2. Figure 2: Time series of Easterly and Northerly components of wind vector for three years. Gray shaded areas are nighttimes. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Smooth scatter plot of wind vectors during the nighttime (2-12 UTC) and daytime excluding 2 hours after sunrise and 2 hours before sunset (14-24 UTC). Bandwidth parameter in R function smoothScatter set to 0.3 in both plots. enough to support that at least the overall pattern shown in this figure is not a statistical fluke. Furthermore, because results for nearby hours are generally highly correlated, diff… view at source ↗
Figure 4
Figure 4. Figure 4: Hourly averages of minute-by-minute medians in component-wise median wind vector. Large black circle corresponds to result for 0:00–0:59 UTC. Results for other hours move in a largely counter clockwise direction. Two small black circles correspond to times of sunset and sunrise. Center of each cross gives the hourly average and each axis of the cross indicates +/− one standard error based on treating resul… view at source ↗
Figure 5
Figure 5. Figure 5: Diurnal cycle in wind speed. Left plot shows ten-minute averages of minute-by-minute medians of wind speeds (black circles) and magnitudes of 10-minute averages of component-wise median wind vectors (gray plus signs). Right plot shows 10-minute averages of minute-by-minute differences between 0.1, 0.25, 0.75 and 0.9 quantiles and medians of wind speed. Dashed vertical lines in these and subsequent plots in… view at source ↗
Figure 6
Figure 6. Figure 6: Wind speeds greater than the 0.9999 quantile of 18.8 m s−1 . Red points are from a 30-minute period in 2008, magenta points are 6 consecutive minutes about 3 hours after the times of the red points and blue points are 16 points from a 17-minute period in 2011. spread and shape (right panel of [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Diurnal cycle in one-minute changes in wind speed. Five curves are for 10-minute averages of minute-by-minute 0.1, 0.25, 0.5, 0.75 and 0.9 quantiles. wind speed (left column, second and third rows of figure), the variation in changes in the orthogonal direction are considerably smaller than those in the parallel direction. In contrast, these distributions are much closer to isotropic during the daytime, al… view at source ↗
Figure 8
Figure 8. Figure 8: Densities for changes in wind vector relative to current wind direction. Top row for current wind speed 1-5 m s−1 , middle row 5-10 m s−1 , bottom row greater than 10 m s−1 . Contour levels are for log10 of the densities at values from -1 (innermost contours) to -3.5 (outermost) with increments of 0.5. Changes in the wind vector outside the ranges shown in these plots occur in about 1 in 1000 minutes. + si… view at source ↗
Figure 9
Figure 9. Figure 9: Stage 1: Time VQ-VAE. Learn a compact, discrete representation of high-resolution wind speed vectors using VQ-VAE. The encoder maps X(i) ∈ R 1440×2 to a continuous latent embedding Z (i) ∈ R 1440×64, where T = 1440 represents 60×24 = 1440 minutes for day i and D = 64 matches the length of each codebook vector ek, where E = {ek} 512 k=1. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: MaskGIT training. For day i, s−1 denotes s (i−1) 1:C , s denotes s (i) 1:T , sM denotes s (i) m(1:T ) . Denote the bidirectional transformer as fθ : {1,...,K} C+T → R T ×K. That is, fθ takes in a C + T dimensional sequence, and outputs a vector of unnormalized probabilities over K for each time point t = 1,...,T. Here, C is the dimension of a conditioning vector; in the independent one-day generative mode… view at source ↗
Figure 11
Figure 11. Figure 11: Iterative Decoding to sample synthetic wind data. We can sample consecutive synthetic days by sampling a new day, conditioned on the last 60 minutes of the previously sampled day. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: shows the differences between the day-level energy score performance for the Embedded generator using consec￾utive sampling and the energy score based on the training data; that is, ES(Vsynth,v (i) test)−ES(Vtrain,v (i) test) versus ES(Vtrain,v (i) test) for each test day i = 1,...,185. −25 0 25 50 0 100 200 300 Energy Score with Training Difference (Synthetic − Training) Average Wind Speed Low Low−Mid Mi… view at source ↗
Figure 13
Figure 13. Figure 13: Hourly energy score differences for Embedded generator with consecutive sampling for every other hour of the day. Each subplot shows one hour of the day with 185 (one per test day) colored by average wind speed over that hour in the testing data. The x-axis shows energy score with training data as the forecast distribution, the y-axis shows the difference between synthetic and training energy scores, and … view at source ↗
Figure 14
Figure 14. Figure 14: Density plots of wind vector for daytime and nighttime. Top row is for training data as in [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Comparisons of observed to simulated diurnal cycles in median winds as in [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Ten-minute averages of minute-by-minute quantiles. Circles for observations, ×’s for simulated data with left column for Em￾bedded generator and right column for Features generator. Dark blue and black for 0.5 quantiles and light blue and gray for 0.9 quantiles. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Wind speeds above 0.9999 quantile for a simulation from Embedded (top) and Features (bottom) generators. Compare to [PITH_FULL_IMAGE:figures/full_fig_p032_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Same as [PITH_FULL_IMAGE:figures/full_fig_p033_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Quantile regression validation of stochastic volatility dynamics at τ = 0.9. (a) Relative reduction in the minimized criterion value C0.9 as predictive complexity increases from scalar lag-1 to full vector history. (b) Baseline volatility under the null (intercept-only) model, quantified by C0.9. Black markers denote observed data; red markers denote the Weather as Embeddings generator; blue markers denot… view at source ↗
read the original abstract

Surface winds can vary substantially from one minute to the next, so there is scope for studying its variation on this fine time scale. Restricting to the month of June to minimize seasonality, this work develops a range of machine learning models for generating realistic time series of surface wind vectors at a site in Lamont, Oklahoma based on more than 30 years of high quality measurements at the minute time scale. Such a generator could be used as an input into models from a range of disciplines, notably for wind energy, but also wildfire spread and aviation, among others. The data show complex diurnal structures in both wind speed and direction that would be challenging to capture with standard time series models, so we consider a number of machine learning approaches to producing a stochastic wind generator based on time vector-quantized variational autoencoders. We consider generating a day's worth of data at a time and generating a day of wind vectors conditional on the previous day's winds. We also study methods for incorporating a discrete weather state variable in the generator. We evaluate the generators using a wide range of formal and informal methods. The best of these generators can capture many but not all of the complex features present in the observational data. In particular, the best of our approaches accurately mimic diurnal changes in wind volatility but struggle to match the observed distribution of extreme wind speeds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript develops several VQ-VAE-based stochastic generators for minute-scale wind vector time series, restricted to June observations at a single Oklahoma site. It examines unconditional day-long generation, generation conditional on the prior day, and variants that incorporate a discrete weather state variable. Using a mix of formal and informal diagnostics, the authors conclude that the strongest models reproduce observed diurnal volatility patterns but do not reproduce the distribution of extreme wind speeds.

Significance. If the empirical findings are substantiated, the work supplies a practical exploratory template for high-frequency wind simulation that can accommodate complex diurnal structure, relevant to wind-energy, wildfire, and aviation applications. The explicit qualification of partial success (diurnal features captured, extremes not) and the breadth of evaluation diagnostics are positive features. The narrow single-site/single-month scope and absence of quantitative performance metrics, however, constrain immediate broader utility.

major comments (3)
  1. [Data and Methods] Data and Methods section: no description is given of the training/validation/test split (or any cross-validation procedure), which is load-bearing for any claim that the generators generalize to held-out observational data.
  2. [Evaluation] Evaluation section: the central claim that the best models 'accurately mimic diurnal changes in wind volatility' but 'struggle to match the observed distribution of extreme wind speeds' is stated without accompanying quantitative metrics (e.g., specific distributional distances, quantile errors, or statistical tests with uncertainty), preventing assessment of effect size.
  3. [Results and Discussion] Results and Discussion: the restriction to a single site and the month of June is presented without quantitative sensitivity checks or discussion of how diurnal structure may vary across seasons or locations, which directly affects the scope of the reported success on diurnal features.
minor comments (2)
  1. [Abstract] Abstract: the data span is described only as 'more than 30 years'; supplying the exact number of years or total minute-level observations would improve precision.
  2. [Methods] Notation: the precise definition and embedding of the discrete weather state variable within the VQ-VAE conditioning should be stated explicitly (currently only alluded to).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed report. We address each major comment below and commit to revisions that strengthen the manuscript without overstating its scope.

read point-by-point responses
  1. Referee: [Data and Methods] Data and Methods section: no description is given of the training/validation/test split (or any cross-validation procedure), which is load-bearing for any claim that the generators generalize to held-out observational data.

    Authors: We agree that an explicit description of the data partitioning procedure is necessary for reproducibility and to substantiate generalization claims. The original manuscript omitted these details. We will add a new subsection to Data and Methods specifying the chronological split used (first 25 years for training, subsequent 5 years for validation, final 5 years for testing) together with the rationale for a temporal rather than random partition in time-series settings. revision: yes

  2. Referee: [Evaluation] Evaluation section: the central claim that the best models 'accurately mimic diurnal changes in wind volatility' but 'struggle to match the observed distribution of extreme wind speeds' is stated without accompanying quantitative metrics (e.g., specific distributional distances, quantile errors, or statistical tests with uncertainty), preventing assessment of effect size.

    Authors: The Evaluation section currently relies on a suite of visual and informal diagnostics. To provide quantitative support for the stated effect sizes, we will insert explicit metrics: Earth Mover's distance between generated and observed wind-speed distributions, mean absolute deviation on hourly volatility statistics, and bootstrap confidence intervals on selected quantile errors. These additions will allow readers to gauge the magnitude of the diurnal capture versus extreme-value mismatch. revision: yes

  3. Referee: [Results and Discussion] Results and Discussion: the restriction to a single site and the month of June is presented without quantitative sensitivity checks or discussion of how diurnal structure may vary across seasons or locations, which directly affects the scope of the reported success on diurnal features.

    Authors: The June/single-site restriction was chosen deliberately to isolate diurnal structure by removing seasonal confounding, as stated in the abstract. We will expand the Discussion to include a qualitative review, supported by cited meteorological literature, of how diurnal wind patterns can differ by season and geographic setting. Quantitative sensitivity checks across additional sites and months are not feasible with the present dataset; we will therefore frame this explicitly as a scope limitation rather than performing new empirical checks. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is an empirical ML study that trains VQ-VAE models on held-out observational wind data from one site and evaluates generated time series against formal and informal diagnostics on the same external dataset. No derivation chain, fitted parameter renamed as prediction, or self-citation load-bearing step exists; all claims reduce to standard training/evaluation against independent benchmarks rather than to the model's own inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central modeling effort rests on standard VAE training assumptions plus the domain choice of June-only data from one site; no new entities are postulated.

free parameters (2)
  • VAE architecture hyperparameters and training schedule
    Standard in neural generative models; values chosen to fit the wind dataset.
  • Number of discrete codes in vector quantization
    Chosen during model design to balance reconstruction and generation quality.
axioms (1)
  • domain assumption June data from Lamont site sufficiently represents the target diurnal structures without seasonal confounding
    Explicitly stated as the reason for restricting to one month.

pith-pipeline@v0.9.1-grok · 5796 in / 1181 out tokens · 21532 ms · 2026-06-27T14:58:12.734459+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 14 canonical work pages

  1. [1]

    A., Parras, J., and Zazo, S.: An Improved Tabular Data Generator with V AE-GMM Integration, in: 2024 32nd European Signal Processing Conference (EUSIPCO), pp

    Apellaniz, P. A., Parras, J., and Zazo, S.: An Improved Tabular Data Generator with V AE-GMM Integration, in: 2024 32nd European Signal Processing Conference (EUSIPCO), pp. 1886–1890, https://doi.org/10.23919/EUSIPCO63174.2024.10715230,

  2. [2]

    Bessac, J., Ailliot, P., Cattiaux, J., and Monbet, V .: Comparison of hidden and observed regime-switching autoregressive models for (u, v)- components of wind fields in the northeastern Atlantic, Advances in Statistical Climatology, Meteorology and Oceanography, 2, 1–16, https://doi.org/10.5194/ascmo-2-1-2016,

  3. [3]

    Carta, J., Ramírez, P., and Velázquez, S.: A review of wind speed probability distributions used in wind energy analysis: Case studies in the Canary Islands, Renewable and Sustainable Energy Reviews, 13, 933–955, https://doi.org/https://doi.org/10.1016/j.rser.2008.05.005,

  4. [4]

    T., Ke, Z., Sodagartojgi, A., Xia, Z., Moran, G

    Cui, M., Eng, K., Greene, J. T., Ke, Z., Sodagartojgi, A., Xia, Z., Moran, G. E., and Stein, M. L.: Zernjk/Stochastic-weather-generators-for- high- frequency-wind-vector-time-series: Version 1.0.0 for Copernicus manuscript submission, https://doi.org/10.5281/zenodo.20421182, 2026a. Cui, M., Eng, K., Greene, J. T., Ke, Z., Sodagartojgi, A., Xia, Z., Moran,...

  5. [5]

    Desai, A., Freeman, C., Wang, Z., and Beaver, I.: Timevae: A variational auto-encoder for multivariate time series generation, arXiv preprint arXiv:2111.08095,

  6. [6]

    Jiang, Y ., Song, Z., and Kusiak, A.: Very short-term wind speed forecasting with Bayesian structural break model, Renewable Energy, 50, 637–647, https://doi.org/https://doi.org/10.1016/j.renene.2012.07.041,

  7. [7]

    Jordan, A., Krüger, F., and Lerch, S.: Evaluating probabilistic forecasts with scoring rules, Journal of Statistical Software, 90, 1–37, https://doi.org/10.18637/jss.v090.i12,

  8. [8]

    Koenker, R.: quantreg: Quantile Regression, https://CRAN.R-project.org/package=quantreg, r package version 6.1, accessed 18 April 2025,

  9. [10]

    Kyrouac, J., Shi, Y ., and Tuftedal, M.: Surface Meteorological Instrumentation (MET), 1993-07-21 to 2025-02-03, Southern Great Plains (SGP), Lamont, OK (Extended and Co-located with C1) (E13), https://doi.org/10.5439/1786358,

  10. [11]

    Liu, Z., Jiang, P., Zhang, L., and Niu, X.: A combined forecasting model for time series: Application to short-term wind speed forecasting, Applied Energy, 259, 114 137, https://doi.org/https://doi.org/10.1016/j.apenergy.2019.114137,

  11. [12]

    National Weather Service: Oklahoma Tornadoes by County and Month (1950-2024), https://www.weather.gov/oun/ tornadodata-ok-countybymonth [Accessed: 11/13/2025],

  12. [13]

    Nikolaev, N. Y ., Smirnov, E., Stamate, D., and Zimmer, R.: A regime-switching recurrent neural network model applied to wind time series, Applied Soft Computing, 80, 723–734, https://doi.org/https://doi.org/10.1016/j.asoc.2019.04.009,

  13. [14]

    Rhudy, M. B. and Longenberger, M.: Stochastic Wind Speed Modeling and Prediction Using Historical Wind Data for Aircraft Applications, in: AIAA A VIATION FORUM AND ASCEND 2024, https://doi.org/10.2514/6.2024-3849,

  14. [15]

    A., Stanley, M

    Shah, T. A., Stanley, M. C., and Warner, J. E.: Generative modeling of microweather wind velocities for urban air mobility, arXiv preprint arXiv:2503.02690,

  15. [16]

    Shi, Y ., Zhao, W., Guan, H., and Kumar, N.: Wind Speed Distributions Used in Wind Energy Assessment: A Review, Frontiers in Energy Research, 9, 769 920, https://doi.org/10.3389/fenrg.2021.769920,

  16. [17]

    Wang, K., Kim, M., Castruccio, S., and Genton, M

    57 Wang, H., Liu, J., Yin, S., Qiao, H., Zhu, Z., and Hall, J.: HWGEN: An hourly wind stochastic GENerator, International Soil and Water Conservation Research, https://doi.org/https://doi.org/10.1016/j.iswcr.2025.10.005, 2025a. Wang, K., Kim, M., Castruccio, S., and Genton, M. G.: Modelling high-resolution spatio-temporal wind with deep echo state net- wo...

  17. [18]

    Yunus, K., Thiringer, T., and Chen, P.: ARIMA-Based Frequency-Decomposed Modeling of Wind Speed Time Series, IEEE Transactions on Power Systems, 31, 2546–2556, https://doi.org/10.1109/TPWRS.2015.2468586,

  18. [19]

    and Genton, M

    Zhu, X. and Genton, M. G.: Short-Term Wind Speed Forecasting for Power System Operations, International Statistical Review, 80, 2–23, https://doi.org/https://doi.org/10.1111/j.1751-5823.2011.00168.x,