pith. sign in

arxiv: 2605.20539 · v1 · pith:EBOUXKN5new · submitted 2026-05-19 · 💻 cs.LG

OpenSeisML: Open Large-Scale Real Seismic and well-log Dataset for Generative AI

Pith reviewed 2026-05-21 06:44 UTC · model grok-4.3

classification 💻 cs.LG
keywords seismic inversiongenerative AIopen datasetwell logssubsurface propertiesuncertainty quantificationmachine learninggeophysics
0
0 comments X

The pith

OpenSeisML supplies curated public seismic volumes and well logs to train generative models that produce multiple subsurface realizations for uncertainty quantification in inversion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OpenSeisML, a collection of real seismic and well-log data drawn from publicly available UK surveys. It describes an automated pipeline that converts time-domain seismic data to depth using interpolated checkshot information to build consistent velocity models. The central goal is to give researchers open material for training generative AI models that learn the statistical patterns of subsurface properties. With such models, multiple statistically consistent realizations can be synthesized to support uncertainty estimates that then serve as priors in seismic inversion. This approach directly tackles the shortage of realistic open data that has restricted machine-learning progress in geophysics.

Core claim

The authors present OpenSeisML as an open large-scale dataset of real seismic volumes and well logs, assembled through an automated curation pipeline that performs time-to-depth conversion via checkshot interpolation, specifically to enable generative models that capture the statistical distribution of subsurface properties and thereby generate multiple realizations for uncertainty quantification in seismic inversion.

What carries the argument

The OpenSeisML dataset together with its automated curation pipeline that converts time-domain seismic to depth using checkshot interpolation to produce reproducible velocity models suitable for generative modeling.

Load-bearing premise

The selected UK public survey data, once processed through the automated time-to-depth conversion, sufficiently represent the statistical distribution of subsurface properties so that generative models trained on them can produce useful realizations for other regions or surveys.

What would settle it

Train a generative model on OpenSeisML and test whether the resulting realizations, when used as priors, measurably improve inversion accuracy or uncertainty calibration on a held-out seismic survey from a different geological setting.

Figures

Figures reproduced from arXiv: 2605.20539 by Charles Jones, Felix J. Herrmann, Huseyin Tuna Erdinc, Ipsita Bhar, Thales Souza.

Figure 1
Figure 1. Figure 1: The flow diagram for data curation pipeline and the table shows the well logs along with their units present in the las files. ability to generalize across diverse marine environments. We have automated our data curation pipeline to produce struc￾tured and consistent datasets that can be used directly for train￾ing without additional preprocessing. MACHINE LEARNING DATA CURATION PIPELINE The data curation … view at source ↗
Figure 2
Figure 2. Figure 2: UKNDR GUI for seismic data filtering and downloading corresponding survey boundaries. To represent 3D seismic data on a regular grid, we first estimate survey boundaries us￾ing a concave hull to account for irregular acquisition geome￾tries, and then extract the largest contiguous rectangular region with regular grids (GeeksforGeeks, 2025). Checkshot-Based Velocity Volume Construction for Time￾Depth Seismi… view at source ↗
Figure 3
Figure 3. Figure 3: 3-D visualization of smooth velocity field constructed using checkshots The general RBF interpolant is defined as: f(x) =X N i=1 λi φ(∥x−xi∥) (1) • x represents a spatial location where velocity is esti￾mated (e.g., grid point in the seismic volume: (x,y) or (x,y,z)) • xi are the locations of known data points (checkshot positions) • f(x) represents interpolated value, i.e., velocity at lo￾cation x • λi we… view at source ↗
Figure 5
Figure 5. Figure 5: Quasi-2D lines passing through different well locations The extracted seismic sections corresponding to the quasi-2D lines were resampled to 256×512 using a 2D FFT, where a smooth low-pass filter with a cosine taper suppresses high￾frequency components, preserving low frequencies while grad- [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: (a) represents the ground truth velocity the network learns from the wells, (b) represents the corresponding seismic section, (c) is a velocity sample as reconstructed by the diffusion model, and (d) presents the overlaid velocity sample on the seismic image. RESULTS A diffusion-based generative model (Erdinc et al., 2024) was initially trained on Compass dataset having the same dimen￾sion and sampling int… view at source ↗
Figure 6
Figure 6. Figure 6: Seismic image after converting from Time to Depth domain visualised in OpendTect [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Both the figures show different sections of 2-D resampled seismic in depth tied to wells (c) (a) (b) (d) [PITH_FULL_IMAGE:figures/full_fig_p004_7.png] view at source ↗
read the original abstract

The advent of machine learning (ML) and computer vision has significantly accelerated seismic inversion workflows by reducing the computational cost of traditionally expensive iterative methods. However, the development and evaluation of ML methods remain limited by the scarcity of realistic velocity models, as most high-quality data are privately owned by oil and gas companies. To address this gap, we present OpenSeisML, a collection of real seismic datasets designed to support generative AI (Gen-AI) workflows for seismic inversion. The datasets are curated from publicly available surveys in the UK National Data Repository (NDR). When seismic volumes are in the time domain and wells are in depth, a time-to-depth conversion is required. We use checkshot data to establish the time-depth relationship and construct a velocity model through interpolation for accurate conversion of post-stack seismic data. Here, we present an automated data curation pipeline that enables seismic data preparation while ensuring reproducibility. The objective is to train a generative model that captures the statistical distribution of subsurface properties, enabling the synthesis of multiple statistically consistent realizations for uncertainty quantification which can act as a prior for seismic inversion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents OpenSeisML, a collection of real seismic and well-log datasets curated from publicly available UK National Data Repository (NDR) surveys. It describes an automated data curation pipeline that performs time-to-depth conversion of post-stack seismic volumes by interpolating checkshot data to construct velocity models. The objective is to enable training of generative AI models that capture the statistical distribution of subsurface properties for synthesizing multiple realizations to support uncertainty quantification in seismic inversion.

Significance. If the curation and conversion steps are shown to preserve the relevant statistical properties of real subsurface geology, the open release of this large-scale dataset would meaningfully advance machine-learning applications in geophysics by removing a key barrier of data scarcity. The emphasis on an automated, reproducible pipeline is a concrete strength that supports open science and community reuse.

major comments (1)
  1. [Automated data curation pipeline] Automated data curation pipeline: the description of time-to-depth conversion via checkshot interpolation supplies no quantitative validation (mis-tie analysis, comparison to sonic logs or depth-migrated volumes, or checks on preservation of autocorrelation lengths, impedance contrasts, or variograms). This directly affects whether the released volumes can serve as training data whose statistical distribution matches real subsurface properties, which is load-bearing for the central claim that the dataset supports useful generative models for uncertainty quantification.
minor comments (1)
  1. [Abstract] The abstract would benefit from explicit statements of dataset scale (number of surveys, total inline/crossline counts, or total volume in GB) to substantiate the 'large-scale' descriptor.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of OpenSeisML to advance machine learning applications in geophysics. We provide a point-by-point response to the major comment below.

read point-by-point responses
  1. Referee: Automated data curation pipeline: the description of time-to-depth conversion via checkshot interpolation supplies no quantitative validation (mis-tie analysis, comparison to sonic logs or depth-migrated volumes, or checks on preservation of autocorrelation lengths, impedance contrasts, or variograms). This directly affects whether the released volumes can serve as training data whose statistical distribution matches real subsurface properties, which is load-bearing for the central claim that the dataset supports useful generative models for uncertainty quantification.

    Authors: We acknowledge the validity of this observation. The manuscript as submitted emphasizes the design of the automated, reproducible curation pipeline but does not present quantitative validation results for the time-to-depth conversion step. To address this, we will revise the manuscript to include quantitative assessments. Specifically, we will perform and report mis-tie analyses at well locations using the interpolated velocity models, compare converted seismic data with any available depth-domain equivalents where they exist in the public domain, and evaluate preservation of statistical properties including autocorrelation lengths and variogram models on selected volumes. These additions will be supported by figures and tables in a new validation subsection. We believe this will confirm that the converted data retain the essential geological statistics needed for generative modeling. revision: yes

Circularity Check

0 steps flagged

Data-release paper with no derivations, predictions, or self-referential claims

full rationale

The manuscript is a description of a curated public seismic dataset and an automated curation pipeline for time-to-depth conversion. No equations, fitted parameters, generative-model outputs, or statistical predictions are presented as results derived from first principles. The central contribution is the release of the data volumes themselves; the pipeline steps are procedural rather than deductive. No self-citations are invoked to justify uniqueness or to close a logical loop. The work is therefore self-contained against external benchmarks and contains no load-bearing step that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard geophysical data-preparation practices rather than new theoretical constructs. No free parameters or invented entities are introduced in the abstract; the main reliance is on domain conventions for time-depth conversion.

axioms (1)
  • domain assumption Checkshot data from wells can be interpolated to produce a reliable time-depth relationship for converting post-stack seismic volumes.
    Invoked when describing the conversion step for time-domain seismic data.

pith-pipeline@v0.9.0 · 5737 in / 1253 out tokens · 39755 ms · 2026-05-21T06:44:32.571989+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages

  1. [1]

    Micha owicz, M

    Alaudah, Y., P. Micha owicz, M. Alfarraj, and G. AlRegib, 2019, A machine-learning benchmark for facies classification: Interpretation, 7 , SE175--SE187

  2. [2]

    Chen, H., J. Chen, M. D. Sacchi, J. Gao, and P. Yang, 2025, Unsupervised seismic acoustic impedance inversion based on generative diffusion model: Geophysics, 90 , M109--M121

  3. [3]

    (Medium article, accessed 2026)

    Consolvo, B., 2023, Seismic data to subsurface models with openfwi: Training an ai model on the latest intel xeon cpu with pytorch 2.0: https://medium.com/better-programming/seismic-data-to-subsurface-models-with-openfwi-bcca0218b4e8. (Medium article, accessed 2026)

  4. [4]

    Deng, C., S. Feng, H. Wang, X. Zhang, P. Jin, Y. Feng, Q. Zeng, Y. Chen, and Y. Lin, 2022, Openfwi: Large-scale multi-structural benchmark datasets for full waveform inversion: Presented at the Advances in Neural Information Processing Systems (NeurIPS), Curran Associates, Inc

  5. [5]

    (Accessed: 2026-02-12)

    dGB Earth Sciences , 2026, Opendtect pro & dgb plugins documentation - 7.0: https://doc.opendtect.org/7.0.0/doc/dgb_userdoc/Default.htm. (Accessed: 2026-02-12)

  6. [6]

    (https://hdsr.mitpress.mit.edu/pub/g9mau4m0)

    Donoho, D., 2024, Data Science at the Singularity : Harvard Data Science Review, 6 . (https://hdsr.mitpress.mit.edu/pub/g9mau4m0)

  7. [8]

    Cheng, 2007, Seam: The seg advanced modeling project, phase i: AGU Fall Meeting Abstracts

    Fehler, M., and A. Cheng, 2007, Seam: The seg advanced modeling project, phase i: AGU Fall Meeting Abstracts

  8. [10]

    (Accessed: 2026)

    GeeksforGeeks , 2025, Largest rectangular area in a histogram using stack: https://www.geeksforgeeks.org/dsa/largest-rectangular-area-in-a-histogram-using-stack/. (Accessed: 2026)

  9. [11]

    A., 2011, First steps in seismic interpretation: Society of Exploration Geophysicists, volume 16 of Society of Exploration Geophysicists Geophysical Monograph Series

    Herron, D. A., 2011, First steps in seismic interpretation: Society of Exploration Geophysicists, volume 16 of Society of Exploration Geophysicists Geophysical Monograph Series

  10. [12]

    Janssen, V., 2009, Understanding coordinate reference systems, datums and transformations: International Journal of Geoinformatics, 5

  11. [13]

    Jin, P., Y. Feng, S. Feng, H. Wang, Y. Chen, B. Consolvo, Z. Liu, and Y. Lin, 2024, An empirical study of large-scale data-driven full waveform inversion: Scientific Reports, 14 , 20034

  12. [14]

    Jones, C. E., J. A. Edgar, J. I. Selvage, and H. Crook, 2012, Building complex synthetic models to evaluate acquisition geometries and velocity inversion technologies: 74th EAGE Conference and Exhibition Incorporating EUROPEC 2012, European Association of Geoscientists & Engineers, cp--293--00580

  13. [15]

    D., and Y

    Kosloff, D. D., and Y. Sudman, 2002, Uncertainty in determining interval velocities from surface reflection seismic data: Geophysics, 67 , 952--963

  14. [16]

    Wacławiak, M

    Mekonnin, A., K. Wacławiak, M. Humayun, S. Zhang, and H. Ullah, 2025, Hydrogen storage technology, and its challenges: A review: Catalysts, 15 , 260

  15. [17]

    (Contains information provided by the North Sea Transition Authority and/or other third parties)

    North Sea Transition Authority , 2026, Uk national data repository: https://www.nstauthority.co.uk/data-and-insights/data/uk-national-data-repository/. (Contains information provided by the North Sea Transition Authority and/or other third parties)

  16. [18]

    Siahkoohi, M

    Orozco, R., A. Siahkoohi, M. Louboutin, and F. J. Herrmann, 2025, Aspire: Iterative amortized posterior inference for bayesian inverse problems: Inverse Problems, 41 , 045001

  17. [19]

    Blythe, 2014, Seam update: Seam participants share their views: The Leading Edge, 33 , 234--236

    Pangman, P., and N. Blythe, 2014, Seam update: Seam participants share their views: The Leading Edge, 33 , 234--236

  18. [20]

    Skala, V., 2017, Radial basis function interpolation and applications: An incremental approach: Latest Trends on Applied Mathematics, Simulation, Modelling, 1--8

  19. [21]

    Orozco, and F

    Yin, Z., R. Orozco, and F. J. Herrmann, 2025, Wiser: Multimodal variational inference for full-waveform inversion without dimensionality reduction: Geophysics, 90 , A1--A7

  20. [22]

    Orozco, M

    Yin, Z., R. Orozco, M. Louboutin, and F. J. Herrmann, 2024, Wise: Full-waveform variational inference via subsurface extensions: Geophysics, 89 , A23--A28

  21. [23]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    OpenFWI: Large-scale Multi-structural Benchmark Datasets for Full Waveform Inversion , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  22. [24]

    , title =

    Mora, Carmen B. , title =. 2002 , month =

  23. [25]

    Scientific Reports , year =

    Jin, Peng and Feng, Yinan and Feng, Shihang and Wang, Hanchen and Chen, Yinpeng and Consolvo, Benjamin and Liu, Zicheng and Lin, Youzuo , title =. Scientific Reports , year =. doi:10.1038/s41598-024-20034-0 , url =

  24. [26]

    2024 , howpublished =

  25. [27]

    2023 , howpublished =

    Hallam, Antony , title =. 2023 , howpublished =

  26. [28]

    and Busby, Mark and Nealon, Jeff and Zaske, Joerg , title =

    Bartel, David C. and Busby, Mark and Nealon, Jeff and Zaske, Joerg , title =. SEG Technical Program Expanded Abstracts , year =. doi:10.1190/1.2369965 , url =

  27. [29]

    , title =

    Yin, Ziyi and Orozco, Rafael and Louboutin, Mathias and Herrmann, Felix J. , title =. Geophysics , year =. doi:10.1190/geo2023-0744.1 , url =

  28. [30]

    UK National Data Repository , year =

  29. [31]

    OpendTect Pro & dGB Plugins Documentation - 7.0 , year =

  30. [32]

    Interpretation , volume =

    A Machine-Learning Benchmark for Facies Classification , author =. Interpretation , volume =. 2019 , doi =

  31. [33]

    Jones, C. E. and Edgar, J. A. and Selvage, J. I. and Crook, H. , title =. 74th EAGE Conference and Exhibition Incorporating EUROPEC 2012 , year =. doi:10.3997/2214-4609.20148575 , isbn =

  32. [34]

    , title =

    Orozco, Rafael and Siahkoohi, Ali and Louboutin, Mathias and Herrmann, Felix J. , title =. Inverse Problems , volume =. 2025 , publisher =

  33. [35]

    International Conference on Learning Representations (ICLR) , year =

    Unsupervised Learning of Full-Waveform Inversion: Connecting CNN and Partial Differential Equation in a Loop , author =. International Conference on Learning Representations (ICLR) , year =

  34. [36]

    The Leading Edge , volume =

    The Marmousi experience: Velocity model determination on a synthetic complex data set , author =. The Leading Edge , volume =. 1994 , doi =

  35. [37]

    Geophysics , volume =

    WISER: Multimodal variational inference for full-waveform inversion without dimensionality reduction , author =. Geophysics , volume =. 2025 , publisher =. doi:10.1190/geo2024-0483.1 , url =

  36. [38]

    arXiv preprint arXiv:2509.20238 , year =

    Velocity Model Building from Seismic Images Using a Convolutional Neural Operator , author =. arXiv preprint arXiv:2509.20238 , year =. doi:10.48550/arXiv.2509.20238 , url =

  37. [39]

    Latest Trends on Applied Mathematics, Simulation, Modelling , pages=

    Radial Basis Function Interpolation and Applications: An Incremental Approach , author=. Latest Trends on Applied Mathematics, Simulation, Modelling , pages=. 2017 , publisher=

  38. [40]

    , title =

    Al-Chalabi, M. , title =. Developments in Geophysical Exploration Methods—1 , editor =. 1979 , doi =

  39. [41]

    arXiv preprint arXiv:2406.05136 , year=

    Generative geostatistical modeling from incomplete well and imaged seismic observations with diffusion models , author=. arXiv preprint arXiv:2406.05136 , year=

  40. [42]

    IEEE Transactions on Geoscience and Remote Sensing , year =

    Wu, Han and Lu, Shaoping and Dong, Xintong and Deng, Xiaofan , title =. IEEE Transactions on Geoscience and Remote Sensing , year =

  41. [43]

    arXiv preprint arXiv:2502.07169 , year=

    Advancing Geological Carbon Storage Monitoring with 3D Digital Shadow Technology , author=. arXiv preprint arXiv:2502.07169 , year=

  42. [44]

    arXiv preprint arXiv:2508.12939 , year=

    Simulation-Based Inference: A Practical Guide , author=. arXiv preprint arXiv:2508.12939 , year=. doi:10.48550/arXiv.2508.12939 , url=

  43. [45]

    2024 , month =

    Donoho, David , journal =. 2024 , month =

  44. [46]

    arXiv preprint arXiv:2309.02791 , year=

    Seismic Foundation Model (SFM): a new generation deep learning model in geophysics , author=. arXiv preprint arXiv:2309.02791 , year=. doi:10.48550/arXiv.2309.02791 , url=

  45. [47]

    Advances in Geophysics , volume=

    An overview of multimethod imaging approaches in environmental geophysics , author=. Advances in Geophysics , volume=. 2021 , publisher=

  46. [48]

    74th EAGE Conference and Exhibition incorporating EUROPEC 2012 , year=

    Building Complex Synthetic Models to Evaluate Acquisition Geometries and Velocity Inversion Technologies , author=. 74th EAGE Conference and Exhibition incorporating EUROPEC 2012 , year=

  47. [49]

    Hydrogen Storage Technology, and Its Challenges: A Review , volume =

    Mekonnin, Abdisa and Wacławiak, Krzysztof and Humayun, Muhammad and Zhang, Shaowei and Ullah, Habib , year =. Hydrogen Storage Technology, and Its Challenges: A Review , volume =. Catalysts , doi =

  48. [50]

    Geophysics , volume=

    Unsupervised seismic acoustic impedance inversion based on generative diffusion model , author=. Geophysics , volume=. 2025 , doi=

  49. [51]

    and Cheng, A

    Fehler, M. and Cheng, A. , year =. SEAM: The SEG Advanced Modeling Project, Phase I , journal =

  50. [52]

    2023 , howpublished =

    Consolvo, Benjamin , title =. 2023 , howpublished =

  51. [53]

    2025 , howpublished =

    Largest Rectangular Area in a Histogram using Stack , author =. 2025 , howpublished =

  52. [54]

    Understanding coordinate reference systems, datums and transformations , volume =

    Janssen, Volker , year =. Understanding coordinate reference systems, datums and transformations , volume =

  53. [55]

    , title =

    Herron, Donald A. , title =. 2011 , doi =

  54. [56]

    and Sudman, Yonadav , title =

    Kosloff, Dan D. and Sudman, Yonadav , title =. Geophysics , volume =. 2002 , doi =

  55. [57]

    The Leading Edge , volume =

    Pangman, Peter and Blythe, Natalie , title =. The Leading Edge , volume =. 2014 , doi =

  56. [58]

    2025 , month =

    Seismic Dataset Curation from UK National Data Repository to Validate SAGE and WISE , booktitle =. 2025 , month =