pith. sign in

arxiv: 2605.22837 · v1 · pith:5JXEIXAQnew · submitted 2026-05-10 · ⚛️ physics.geo-ph · cs.LG· eess.SP

Evaluating PhaseNet on Teleseismic Data with MsPASS

Pith reviewed 2026-05-25 00:55 UTC · model grok-4.3

classification ⚛️ physics.geo-ph cs.LGeess.SP
keywords PhaseNetteleseismic P-wave pickingmachine learningseismic phase detectionUSArray ANFmodel retrainingperformance evaluationMsPASS
0
0 comments X

The pith

Retraining PhaseNet from scratch on 1.6 million teleseismic picks raises recall by 741.5 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PhaseNet produces accurate P and S picks on local earthquake signals but its performance drops sharply on teleseismic signals from distant sources. The authors assembled a control dataset of 1.6 million waveforms linked to analyst P-wave picks from the USArray Array Network Facility and used it to retrain the model from scratch. On a held-out test split the retrained model increased P-pick recall by 741.5 percent and produced 683.9 percent more picks inside a 0.1-second residual window. Tests of larger model variants showed modest further gains in precision and recall but large drops in inference speed, especially on CPUs. A sympathetic reader would care because reliable automated picking on teleseismic data could let networks process far more global earthquake recordings without proportional increases in analyst time.

Core claim

The authors assembled a control dataset of 1.6 million teleseismic waveforms linked to P-wave picks made by analysts at the USArray Array Network Facility. The original PhaseNet model trained on regional signals performs poorly on these data. Training PhaseNet from scratch on the training split of the ANF control dataset and evaluating it on a non-overlapping held-out test split increased P-pick recall by 741.5 percent and yielded 683.9 percent more picks within a 0.1 s residual window. Increasing model size by about 120 times improved precision and recall by 15.6 percent and 23.2 percent respectively, but reduced inference throughput by 87.2 percent on an NVIDIA A100 GPU and by 97.3 percent

What carries the argument

The ANF control dataset of 1.6 million analyst-labeled teleseismic waveforms, used for supervised retraining of PhaseNet and quantitative before-after evaluation on held-out data.

If this is right

  • Domain-specific retraining on teleseismic data is required for PhaseNet to achieve high recall on distant events.
  • Larger model sizes deliver only modest accuracy gains while sharply lowering throughput.
  • GPUs make scaled PhaseNet models far more practical than high-core-count CPU nodes.
  • Reproducible workflows enable systematic large-scale training and testing on archived seismic data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same control-dataset approach could be applied to adapt other machine-learning phase pickers to teleseismic signals.
  • The ANF dataset could serve as a public benchmark for comparing future teleseismic picking algorithms.
  • Widespread use of retrained models might allow global networks to handle substantially larger data volumes with existing analyst resources.

Load-bearing premise

The 1.6 million teleseismic waveforms labeled by USArray ANF analysts provide sufficiently accurate and unbiased ground-truth P-wave picks for both supervised training and quantitative evaluation of model performance.

What would settle it

Re-evaluation of the retrained model on an independent collection of teleseismic P-picks made by a different analyst group that shows recall gains below 200 percent would falsify the reported improvement from domain-specific training.

Figures

Figures reproduced from arXiv: 2605.22837 by Chenbo Yin, Gary L. Pavlis, Jinxin Ma, Yinzhi Wang.

Figure 1
Figure 1. Figure 1: Schematic illustration of filter-factor scaling in a generic one-dimensional convolutional [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Precision, recall, and F1 score for PhaseNet-NCEDC, PhaseNet-USArray, and PhaseNet [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: P-pick residual histograms for PhaseNet-NCEDC, PhaseNet-USArray, and PhaseNet [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training loss curves for PhaseNet-USArray and PhaseNet-Scale across sequential year [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Numerous studies have shown that the machine-learning picker PhaseNet produces accurate P and S picks on local earthquake signals, but its performance can degrade sharply on teleseismic signals. To address this limitation, we present a reproducible MsPASS workflow that (i) enables scalable data preparation and management for large seismic archives and (ii) supports standardized PhaseNet training and inference. We assembled a control dataset of 1.6 million waveforms linked to teleseismic P-wave picks made by analysts at the USArray Array Network Facility (ANF). The control dataset confirms that the PhaseNet model trained on regional signals performs poorly on these data. We then trained PhaseNet from scratch on the training split of the ANF control dataset and evaluated it on a non-overlapping held-out test split, increasing P-pick recall by 741.5% and yielding 683.9% more picks within a 0.1s residual window. We also evaluated PhaseNet across different model sizes on both CPUs and GPUs. Increasing the model size by about 120 times improved precision and recall by 15.6% and 23.2%, respectively. However, the scaled model reduced inference throughput by 87.2% on an NVIDIA A100 GPU and by 97.3% on a 128-core high-performance CPU node. These results indicate that scaling PhaseNet is more practical on GPUs than on CPUs, and that simply enlarging the model is not an efficient way to achieve large accuracy gains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a reproducible MsPASS-based workflow for assembling and managing a control dataset of 1.6 million teleseismic waveforms with P-wave picks from USArray ANF analysts. It shows that the original PhaseNet model (trained on regional data) performs poorly on this teleseismic set, then retrains PhaseNet from scratch on a training split and evaluates on a non-overlapping held-out test split, reporting a 741.5% increase in P-pick recall and 683.9% more picks inside a 0.1 s residual window. Additional scaling experiments compare model sizes, showing modest accuracy gains at the cost of substantially reduced inference throughput on both CPU and GPU.

Significance. If the central empirical claims hold after addressing label quality, the work supplies a concrete, scalable example of domain adaptation for ML seismic pickers and demonstrates that retraining on teleseismic data yields far larger gains than simply enlarging the model. The use of a large held-out split and the open MsPASS workflow are strengths that support reproducibility and allow future comparisons.

major comments (2)
  1. [Abstract and Results] Abstract and Results section: The headline metrics (741.5% recall increase and 683.9% more picks within the 0.1 s residual window) are computed by treating ANF analyst picks as exact ground truth for both training and evaluation. The manuscript contains no quantification of timing uncertainty on these teleseismic picks, no cross-check against an independent catalog (e.g., ISC or NEIC), and no sensitivity test varying the residual tolerance. Because teleseismic P onsets are typically emergent and low-SNR, documented analyst uncertainties of 0.2–0.5 s would place a non-negligible fraction of labels outside the 0.1 s acceptance window, undermining the interpretation of relative improvement.
  2. [Methods] Methods (dataset construction): The paper states that the 1.6 M waveforms are linked to ANF analyst picks and that the test split is non-overlapping, but provides no details on how event overlap was prevented (e.g., by event ID, origin time window, or station clustering) or on the distribution of pick quality flags. This information is required to assess whether the reported gains could be inflated by label noise or split leakage.
minor comments (2)
  1. [Abstract] The abstract reports percentage improvements without accompanying absolute numbers (e.g., baseline recall, total picks) or confidence intervals; adding these would improve interpretability.
  2. [Figures and Results] Figure captions and text should explicitly state the exact definition of the 0.1 s residual window (one-sided or two-sided) and whether picks are evaluated only on events that have an ANF label.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important considerations regarding label quality and dataset construction for teleseismic picks. We respond to each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and Results section: The headline metrics (741.5% recall increase and 683.9% more picks within the 0.1 s residual window) are computed by treating ANF analyst picks as exact ground truth for both training and evaluation. The manuscript contains no quantification of timing uncertainty on these teleseismic picks, no cross-check against an independent catalog (e.g., ISC or NEIC), and no sensitivity test varying the residual tolerance. Because teleseismic P onsets are typically emergent and low-SNR, documented analyst uncertainties of 0.2–0.5 s would place a non-negligible fraction of labels outside the 0.1 s acceptance window, undermining the interpretation of relative improvement.

    Authors: We agree that ANF analyst picks for teleseismic events have inherent timing uncertainties larger than regional events due to emergent onsets and lower SNR. Since both the baseline and retrained models are evaluated on the identical label set, the relative improvements still demonstrate the value of domain adaptation. In the revised manuscript we will add an explicit discussion of expected teleseismic pick uncertainties (citing literature values of 0.2-0.5 s) and include a sensitivity analysis reporting recall/precision at residual tolerances of 0.1 s, 0.2 s, and 0.5 s. A direct cross-check against ISC/NEIC catalogs is outside the current scope because our MsPASS workflow does not include the required event matching metadata; we will note this as a limitation for future work. revision: partial

  2. Referee: [Methods] Methods (dataset construction): The paper states that the 1.6 M waveforms are linked to ANF analyst picks and that the test split is non-overlapping, but provides no details on how event overlap was prevented (e.g., by event ID, origin time window, or station clustering) or on the distribution of pick quality flags. This information is required to assess whether the reported gains could be inflated by label noise or split leakage.

    Authors: The held-out test split was formed by partitioning on unique event IDs from the ANF catalog so that no event appears in both training and test sets; this eliminates event-level leakage. We will revise the Methods section to document this event-ID-based partitioning procedure and to report the distribution of ANF pick quality flags (e.g., fraction labeled high-quality). revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical train/test evaluation on held-out data

full rationale

The paper reports an empirical ML experiment: assemble 1.6M teleseismic waveforms with ANF analyst labels, train PhaseNet from scratch on a training split, evaluate recall/precision on a non-overlapping test split, and compare model sizes on CPU/GPU. No derivations, no first-principles predictions, no fitted parameters renamed as independent results, and no self-citation chains are invoked to justify any claim. The performance numbers are direct measurements against the held-out labels; they do not reduce to the inputs by construction. This matches the default case of a self-contained empirical study with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the quality of analyst-generated labels and standard supervised-learning assumptions about data splits; no new physical entities or ad-hoc constants are introduced.

axioms (1)
  • domain assumption Analyst picks at the USArray ANF constitute reliable ground-truth labels for teleseismic P waves
    These labels are used both to train the model and to compute all reported recall and residual metrics.

pith-pipeline@v0.9.0 · 5811 in / 1402 out tokens · 34067 ms · 2026-05-25T00:55:56.319863+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

  1. [1]

    Bulletin of the seismological society of America , shortjournal=

    Automatic earthquake recognition and timing from single traces , author=. Bulletin of the seismological society of America , shortjournal=. 1978 , publisher=

  2. [2]

    Bulletin of the Seismological Society of America , shortjournal=

    An automatic phase picker for local and teleseismic events , author=. Bulletin of the Seismological Society of America , shortjournal=. 1987 , publisher=

  3. [3]

    Physics of the earth and planetary interiors , shortjournal=

    Robust automatic P-phase picking: an on-line implementation in the analysis of broadband seismogram recordings , author=. Physics of the earth and planetary interiors , shortjournal=. 1999 , publisher=

  4. [4]

    IEEE transactions on geoscience and remote sensing , shortjournal=

    PAI-S/K: A robust automatic seismic P phase arrival identification scheme , author=. IEEE transactions on geoscience and remote sensing , shortjournal=. 2002 , publisher=

  5. [5]

    Geophysical Journal International , shortjournal=

    Automated determination of P-phase arrival times at regional and local distances using higher order statistics , author=. Geophysical Journal International , shortjournal=. 2010 , publisher=

  6. [6]

    Journal of Geophysical Research: Machine Learning and Computation , shortjournal=

    Evaluating automated seismic event detection approaches: An application to Victoria Land, East Antarctica , author=. Journal of Geophysical Research: Machine Learning and Computation , shortjournal=. 2024 , publisher=

  7. [7]

    New Manual of Seismological Observatory Practice 2 (NMSOP-2) , pages=

    Automated event and phase identification , author=. New Manual of Seismological Observatory Practice 2 (NMSOP-2) , pages=. 2012 , publisher=

  8. [8]

    Bulletin of the Seismological Society of America , shortjournal=

    Automatic S-wave picker for local earthquake tomography , author=. Bulletin of the Seismological Society of America , shortjournal=. 2009 , publisher=

  9. [9]

    Geophysical Journal International , shortjournal=

    PhaseNet: a deep-neural-network-based seismic arrival-time picking method , author=. Geophysical Journal International , shortjournal=. 2019 , publisher=

  10. [10]

    Bulletin of the Seismological Society of America , shortjournal=

    Generalized seismic phase detection with deep learning , author=. Bulletin of the Seismological Society of America , shortjournal=. 2018 , publisher=

  11. [11]

    Nature communications , shortjournal=

    Earthquake transformer—an attentive deep-learning model for simultaneous earthquake detection and phase picking , author=. Nature communications , shortjournal=. 2020 , publisher=

  12. [12]

    Geophysical Journal International , shortjournal=

    DeepPhasePick: A method for detecting and picking seismic phases from local earthquakes based on highly optimized convolutional and recurrent deep neural networks , author=. Geophysical Journal International , shortjournal=. 2021 , publisher=

  13. [13]

    Journal of Geophysical Research: Solid Earth , shortjournal=

    Which picker fits my data? A quantitative evaluation of deep learning based seismic pickers , author=. Journal of Geophysical Research: Solid Earth , shortjournal=. 2022 , publisher=

  14. [14]

    Seismological Research Letters , shortjournal=

    MsPASS: A data management and processing framework for seismology , author=. Seismological Research Letters , shortjournal=

  15. [15]

    Seismological Research Letters , shortjournal=

    SeisBench—A toolbox for machine learning in seismology , author=. Seismological Research Letters , shortjournal=

  16. [16]

    Earth, Planets and Space , shortjournal=

    Neural phase picker trained on the Japan meteorological agency unified earthquake catalog , author=. Earth, Planets and Space , shortjournal=. 2024 , publisher=

  17. [17]

    Earthquake Science , shortjournal=

    Benchmark on the accuracy and efficiency of several neural network based phase pickers using datasets from China Seismic Network , author=. Earthquake Science , shortjournal=. 2023 , publisher=

  18. [18]

    Geological Society of America Today , shortjournal=

    The usarray initiative , author=. Geological Society of America Today , shortjournal=. 1999 , publisher=

  19. [19]

    Computers & Geosciences , shortjournal=

    Array processing of teleseismic body waves with the USArray , author=. Computers & Geosciences , shortjournal=. 2010 , publisher=

  20. [20]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

    Squeeze-and-Excitation Networks , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

  21. [21]

    IEEE transactions on pattern analysis and machine intelligence , shortjournal=

    Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs , author=. IEEE transactions on pattern analysis and machine intelligence , shortjournal=. 2017 , publisher=

  22. [22]

    Seismica , shortjournal=

    Picking Regional Seismic Phase Arrival Times with Deep Learning , author=. Seismica , shortjournal=. 2025 , doi=

  23. [23]

    Seismica , shortjournal=

    Picking Induced Seismicity with Deep Learning (piSDL) , author=. Seismica , shortjournal=. 2025 , doi=

  24. [24]

    Seismotectonics of the Imperial Valley of southern California , author=

    CEDAR: An approach to the computer automation of short-period local seismic networks, 1. Seismotectonics of the Imperial Valley of southern California , author=. Ph. D. Thesis , year=

  25. [25]

    Scaling Laws for Neural Language Models

    Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

  26. [26]

    doi:10.7914/SN/TA , url =

    USArray Transportable Array , publisher =. doi:10.7914/SN/TA , url =

  27. [27]

    Seismological Research Letters , shortjournal=

    Data products at the IRIS DMC: Stepping stones for research and other applications , author=. Seismological Research Letters , shortjournal=. 2012 , publisher=