pith. sign in

arxiv: 2602.12292 · v1 · pith:2KNE55EPnew · submitted 2026-01-31 · 📡 eess.SP · cs.LG

A Gradient Boosted Mixed-Model Machine Learning Framework for Vessel Speed in the U.S. Arctic

Pith reviewed 2026-05-21 15:10 UTC · model grok-4.3

classification 📡 eess.SP cs.LG
keywords AIS datavessel speedArcticgradient boostingmachine learningsea ice concentrationbathymetrySHAP values
0
0 comments X

The pith

A two-stage machine learning model distinguishes stationary from moving vessels in the Arctic and shows that distance to coast and bathymetric depth drive both the chance and the speed of movement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors aim to characterize how environmental and operational conditions shape vessel speeds in the U.S. Arctic using ten years of AIS data. Because over half the records show zero speed, they build a two-part model that first predicts whether speed is positive and then estimates the value when it is. Gradient boosted trees with random effects handle the nonlinear effects and repeated measures from the same vessels. The results indicate that proximity to coast and water depth are the primary factors, with the models achieving good predictive performance. This matters for understanding safe navigation and planning in a region with increasing traffic and changing ice conditions.

Core claim

The paper claims that integrating AIS observations with sea ice, wind, bathymetry, and other covariates in a gradient boosted mixed model framework allows accurate separation of zero and positive speed over ground records, with an AUC of 0.85 for the positive SOG classifier and 77 percent explained variance in the conditional speed model, and that distance to coast and bathymetric depth dominate the predictions as quantified by SHAP values.

What carries the argument

Two-stage gradient boosted decision trees with random effects for repeated observations, applied first to classify positive speed and second to regress its magnitude, with SHAP decomposition for variable importance.

If this is right

  • Distance to coast and bathymetric depth determine both the likelihood of movement and the magnitude of positive speeds.
  • Changes in course, vessel group, and navigational status add secondary variation to speeds.
  • Wind and sea ice concentration have only modest effects on vessel speeds.
  • The framework supports empirical characterization of operating regimes for speed management and corridor assessment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The model could be extended to forecast how vessel speeds might shift as Arctic sea ice declines over time.
  • Similar two-stage approaches may improve speed predictions in other busy shipping areas with AIS coverage.
  • Real-time environmental data feeds could enable dynamic updates to speed estimates for route optimization.

Load-bearing premise

Records showing zero speed over ground represent actual stationary vessels rather than data transmission problems or gaps, and the merged environmental data sets are at a fine enough scale to capture what really affects local speeds.

What would settle it

Independent tracking of a sample of vessels to check what fraction of zero SOG AIS records match actual stationary positions rather than missing data.

Figures

Figures reproduced from arXiv: 2602.12292 by Indranil Sahoo, Linda Fernandez, Mauli Pant.

Figure 1
Figure 1. Figure 1: Distribution of environmental and spatial covariates by Zero or positive SOG [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Spatial distribution of binary ice-related speed risk across the U.S. Arctic [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ROC curve with AUC (out-of-fold predictions). The close overlap of curves across folds indicates that classifier performance is robust to spatial, temporal, and vessel level heterogeneity and generalizes well beyond individ￾ual training sets. exceeded the no information rate (0.531). Sensitivity for detecting SOG > 0 was 0.703, while specificity for identifying SOG = 0 was 0.810, yielding a balanced accura… view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation of the SOG zero or positive SOG classification model using out-of-fold [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Out-of-fold calibration plot for the GPBoost model. Mean observed and pre [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Global mean absolute SHAP values for the zero or positive SOG classification model. Larger values indicate greater average influence on the probability of positive SOG. Across vessel groups ( [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: SHAP summary for the SOG > 0 classification model, showing the contribution [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: SHAP summary for the SOG > 0 classification model, showing the contribution [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Marginal SHAP effects for key predictors in the SOG > 0 classi￾fication model. Panels show mean SHAP values on the probit scale across binned covariates for (a) distance to coast, (b) change in course over ground (∆COG), and (c) bathymetric depth. Positive SHAP values indicate an in￾creased probability of positive vessel speed. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Global feature importance for the vessel speed regression model. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Marginal SHAP effects for key spatial and navigational predictors in the SOG √ > 0 regression model. Panels show mean SHAP values (on the SOG scale) across binned covariates for (a) change in course over ground (∆COG), (b) bathymetric depth, (c) distance to coast, and (d) an alternative binning of distance to coast. Marker size reflects bin sample size. In contrast, Status 6 (aground), Status 7 (engaged i… view at source ↗
Figure 12
Figure 12. Figure 12: SHAP summary for the SOG > 0 regression model, showing the effect of navigation status on predicted vessel speed [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: SHAP summary for the SOG > 0 regression model, showing the effect of vessel group on predicted vessel speed. substantial increases in predicted SOG. Pilot vessels and tug-tow operations also showed strong positive effects, with mean SHAP values on the order of +0.10 to +0.15. Cargo, fishing, and other general purpose vessels displayed more moderate SOG increases, with mean SHAP values typically between +0… view at source ↗
read the original abstract

Understanding how environmental and operational conditions influence vessel speed is crucial for characterizing navigational conditions in the Arctic. We analyzed Automatic Identification System (AIS) data from 2010-2019 to examine vessel speed over ground (SOG). Over half of the AIS records showed zero SOG, and treating zero and positive SOG as a single continuous process can obscure important patterns. We therefore applied a two-stage machine learning framework, first modeling the probability of SOG greater than zero and then modeling SOG conditional on being positive. AIS observations were integrated with sea ice concentration, course over ground, wind, bathymetric depth, distance to coast, vessel group, and navigational status. Gradient boosted decision trees with random effects captured nonlinear environmental responses while accounting for repeated observations. The positive SOG classifier achieved strong discrimination (AUC = 0.85), while the conditional speed model explained approximately 77 percent of out-of-fold variance. SHAP values quantified covariate effects by decomposing model predictions into additive contributions from individual variables. Distance to coast and bathymetric depth were dominant determinants of both the likelihood and magnitude of vessel speed, while changes in course, vessel group, and navigational status introduced secondary variation. Wind and sea ice effects were modest. Together, these results empirically characterize Arctic vessel operating regimes relevant to speed management and corridor-level assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a two-stage gradient boosted mixed-model machine learning framework applied to 2010-2019 AIS data for vessel speed over ground (SOG) in the U.S. Arctic. Over half the records are zero SOG; the authors therefore first model the probability of positive SOG and then regress SOG conditional on it being positive. Covariates include sea ice concentration, wind, bathymetric depth, distance to coast, vessel group, navigational status, and course over ground. Random effects are included to handle repeated observations. The positive-SOG classifier reports AUC = 0.85 and the conditional model explains ~77% out-of-fold variance. SHAP decomposition identifies distance to coast and bathymetric depth as dominant drivers of both stages, with modest effects from wind and sea ice.

Significance. If the central data-handling assumptions are validated, the work supplies a reproducible, interpretable empirical characterization of Arctic vessel operating regimes that is directly relevant to speed management and corridor assessment. Credit is due for the explicit two-stage separation, out-of-fold validation, inclusion of random effects, and post-hoc SHAP attribution, all of which move beyond black-box prediction toward mechanistic insight.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Data and Methods): the claim that zero SOG records represent genuine stationary vessels is load-bearing for the two-stage separation. The manuscript states that >50% of records are zero but provides no validation against independent sources (e.g., satellite imagery, port logs, or transmission-gap statistics) nor any sensitivity analysis that reclassifies a plausible fraction of zeros as missing. If a non-negligible share of zeros are transmission artifacts—especially plausible given patchy Arctic satellite coverage—the classifier may be learning data completeness rather than navigational regime, rendering the reported AUC = 0.85 and the SHAP attributions for distance-to-coast, bathymetry, and sea ice confounded.
  2. [§4 and §5] §4 (Results) and §5 (Discussion): the 77% out-of-fold variance and AUC = 0.85 are presented without quantitative assessment of residual spatial or temporal autocorrelation beyond the random effects. AIS observations are clustered by vessel and by geographic cell; if the random-effect structure does not fully capture this dependence, the reported performance metrics and the dominance ordering of SHAP values may be optimistically biased.
minor comments (2)
  1. [§3] §3: specify the exact hyperparameter search procedure and the software implementation (e.g., XGBoost or LightGBM version) used for the gradient-boosted trees.
  2. [Figure 3] Figure 3 (SHAP summary plots): add axis scales and units for the environmental covariates so that the magnitude of effects can be directly compared across variables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript. The comments highlight important considerations regarding data interpretation and model validation that we will address in the revision. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Data and Methods): the claim that zero SOG records represent genuine stationary vessels is load-bearing for the two-stage separation. The manuscript states that >50% of records are zero but provides no validation against independent sources (e.g., satellite imagery, port logs, or transmission-gap statistics) nor any sensitivity analysis that reclassifies a plausible fraction of zeros as missing. If a non-negligible share of zeros are transmission artifacts—especially plausible given patchy Arctic satellite coverage—the classifier may be learning data completeness rather than navigational regime, rendering the reported AUC = 0.85 and the SHAP attributions for distance-to-coast, bathymetry, and sea ice confounded.

    Authors: We agree that the interpretation of zero SOG records as stationary vessels is central to the two-stage approach and that the manuscript would benefit from additional robustness checks. While standard practice in AIS analysis treats zero SOG as indicating a stationary vessel (consistent with continued transmission while not under way), we acknowledge the referee's concern about possible transmission artifacts in regions with intermittent satellite coverage. To directly address this, we will add a sensitivity analysis in the revised §3: we will randomly reclassify 10% and 20% of zero-SOG records as missing, re-fit both stages of the model, and report changes in AUC, out-of-fold R², and SHAP rankings. We will also expand the discussion in §5 to note this limitation and the results of the sensitivity checks. These additions will clarify the robustness of the reported performance metrics and covariate attributions. revision: yes

  2. Referee: [§4 and §5] §4 (Results) and §5 (Discussion): the 77% out-of-fold variance and AUC = 0.85 are presented without quantitative assessment of residual spatial or temporal autocorrelation beyond the random effects. AIS observations are clustered by vessel and by geographic cell; if the random-effect structure does not fully capture this dependence, the reported performance metrics and the dominance ordering of SHAP values may be optimistically biased.

    Authors: We recognize that vessel-level random effects alone may not fully capture spatial clustering of observations within geographic cells. In the revised manuscript we will add quantitative residual diagnostics to §4, including (i) temporal autocorrelation functions of residuals stratified by vessel and (ii) spatial variograms or Moran's I statistics computed on residuals aggregated to 0.1° grid cells. These diagnostics will be reported alongside the existing performance metrics. If notable residual dependence remains, we will discuss its implications for the reported AUC and variance-explained values and for the stability of the SHAP ordering, and we will consider whether additional spatial random effects or block cross-validation would be feasible in future extensions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard two-stage ML with out-of-fold evaluation

full rationale

The paper implements a conventional two-part gradient boosted model (classifier for P(SOG>0) followed by conditional regression) on AIS data merged with environmental covariates. Performance is reported via AUC on the classifier and out-of-fold R² on the speed model, with post-hoc SHAP attribution. No derivation step reduces by construction to a fitted parameter, self-citation chain, or renamed input; the central claims are empirical fits evaluated on held-out folds. The zero-SOG modeling choice is an explicit assumption whose validity is external to the reported metrics and does not create definitional circularity.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard observational data assumptions and ML modeling choices rather than new axioms or invented entities. No parameter-free derivations or external benchmarks are referenced.

free parameters (2)
  • Gradient boosting hyperparameters
    Tuned during training to achieve reported AUC and variance explained; specific values not stated in abstract.
  • Random effect variance components
    Estimated from repeated vessel observations to account for within-vessel correlation.
axioms (2)
  • domain assumption AIS records accurately capture true speed over ground without systematic transmission gaps or errors
    Invoked when treating >50% zero SOG records as meaningful stationary events and merging with environmental layers.
  • domain assumption Environmental covariates are measured at scales relevant to vessel speed decisions
    Required for distance to coast, bathymetry, sea ice, and wind to be treated as direct drivers.

pith-pipeline@v0.9.0 · 5775 in / 1470 out tokens · 50773 ms · 2026-05-21T15:10:59.447111+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

  1. [1]

    Marine Policy 75, 300–317

    On the future navigability of arctic sea routes: high-resolution projections of the arctic ocean and sea ice. Marine Policy 75, 300–317. doi:10.1016/j.marpol.2015.12.027. Chang, K.Y., He, S.S., Chou, C.C., Kao, S.L., Chiou, A.S.,

  2. [2]

    Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM. pp. 785–794. doi:10.1145/ 2939672.2939785. Copernicus Climate Change Service (C3S),

  3. [3]

    doi:10.24381/cds

    Era5 post-processed daily-statistics on single levels from 1940 to present. doi:10.24381/cds. 4991cf48. accessed on 28-Sep-2025. Feinstein, A.R., Cicchetti, D.V.,

  4. [4]

    the 35 problems of two paradoxes

    High agreement but low kappa: I. the 35 problems of two paradoxes. Journal of Clinical Epidemiology 43, 543–549. doi:10.1016/0895-4356(90)90158-L. Goerlandt, F., Montewka, J., Zhang, W., Kujala, P.,

  5. [5]

    Masset, R

    Potential impacts of shipping noise on marine mammals in the western canadian arctic. Marine Pollution Bulletin 123, 73–82. doi:10.1016/j. marpolbul.2017.09.027. Hauser, D.D.W., Laidre, K.L., Stern, H.L., Moore, S.E., Suydam, R.S.,

  6. [6]

    Caziot and B

    Habitat selection by two beluga whale populations in the chukchi and beaufort seas. PLOS ONE 13, e0203657. doi:10.1371/journal.pone. 0203657. HSVA Arctic Technology,

  7. [7]

    Journal of Climate 34, 2923–2939

    Improvements of the daily optimum inter- polation sea surface temperature (doisst) version 2.1. Journal of Climate 34, 2923–2939. doi:10.1175/JCLI-D-20-0166.1. Japkowicz, N., Stephen, S.,

  8. [8]

    Montewka, J., Goerlandt, F., Kujala, P., Lensu, M.,

    URL:https: //www.mdpi.com/2076-3417/14/18/8484, doi:10.3390/app14188484. Montewka, J., Goerlandt, F., Kujala, P., Lensu, M.,

  9. [9]

    Marine Policy 44, 375–389

    Distribution of endemic cetaceans in relation to hy- drocarbon development and commercial shipping in the canadian arctic. Marine Policy 44, 375–389. doi:10.1016/j.marpol.2013.10.005. Sigrist, F.,

  10. [10]

    doi:10.48550/arXiv.2004

    Gaussian process boosting. doi:10.48550/arXiv.2004. 02653. U.S. Committee on the Marine Transportation System,

  11. [11]

    Arctic Region, 2020–2030

    A Ten-Year Projection of Maritime Activity in the U.S. Arctic Region, 2020–2030. Technical Report. U.S. Committee on the Marine Transportation System. Washington, D.C. Wang, J., Guo, Y., Wang, Y.,

  12. [13]

    JournalofMarineScienceandEngineering11

    The study of fishing vessel behavior identification based on ais data: A case study of the eastchinasea. JournalofMarineScienceandEngineering11. URL:https: //www.mdpi.com/2077-1312/11/5/1093, doi:10.3390/jmse11051093. Yang, L., Chen, G., Zhao, J., Rytter, N.G.M.,

  13. [14]

    37 Yang, Y., Liu, Y., Li, G., Zhang, Z., Liu, Y.,

    URL:https://www.mdpi.com/ 2071-1050/12/9/3649, doi:10.3390/su12093649. 37 Yang, Y., Liu, Y., Li, G., Zhang, Z., Liu, Y.,

  14. [15]

    Transportation Research Part E: Logistics and Trans- portation Review 183, 103426

    Harnessing the power of machine learning for ais data-driven maritime research: A compre- hensive review. Transportation Research Part E: Logistics and Trans- portation Review 183, 103426. URL:https://www.sciencedirect. com/science/article/pii/S1366554524000164, doi:https://doi.org/ 10.1016/j.tre.2024.103426. Zhou, Y., Daamen, W., Vellinga, T., Hoogendoorn, S.P.,

  15. [16]

    Differential treatment for time and frequency dimensions in mel-spectrograms: An efficient 3d spectrogram network for underwater acoustic target classification

    Im- pacts of wind and current on ship behavior in ports and waterways: A quantitative analysis based on ais data. Ocean Engineering 213, 107774. URL:https://www.sciencedirect.com/science/article/ pii/S0029801820307514, doi:https://doi.org/10.1016/j.oceaneng. 2020.107774. Appendix A. Figure A1:Maximum recommended vessel speed as a function of sea ice con- ...