A Gradient Boosted Mixed-Model Machine Learning Framework for Vessel Speed in the U.S. Arctic
Pith reviewed 2026-05-21 15:10 UTC · model grok-4.3
The pith
A two-stage machine learning model distinguishes stationary from moving vessels in the Arctic and shows that distance to coast and bathymetric depth drive both the chance and the speed of movement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that integrating AIS observations with sea ice, wind, bathymetry, and other covariates in a gradient boosted mixed model framework allows accurate separation of zero and positive speed over ground records, with an AUC of 0.85 for the positive SOG classifier and 77 percent explained variance in the conditional speed model, and that distance to coast and bathymetric depth dominate the predictions as quantified by SHAP values.
What carries the argument
Two-stage gradient boosted decision trees with random effects for repeated observations, applied first to classify positive speed and second to regress its magnitude, with SHAP decomposition for variable importance.
If this is right
- Distance to coast and bathymetric depth determine both the likelihood of movement and the magnitude of positive speeds.
- Changes in course, vessel group, and navigational status add secondary variation to speeds.
- Wind and sea ice concentration have only modest effects on vessel speeds.
- The framework supports empirical characterization of operating regimes for speed management and corridor assessment.
Where Pith is reading between the lines
- The model could be extended to forecast how vessel speeds might shift as Arctic sea ice declines over time.
- Similar two-stage approaches may improve speed predictions in other busy shipping areas with AIS coverage.
- Real-time environmental data feeds could enable dynamic updates to speed estimates for route optimization.
Load-bearing premise
Records showing zero speed over ground represent actual stationary vessels rather than data transmission problems or gaps, and the merged environmental data sets are at a fine enough scale to capture what really affects local speeds.
What would settle it
Independent tracking of a sample of vessels to check what fraction of zero SOG AIS records match actual stationary positions rather than missing data.
Figures
read the original abstract
Understanding how environmental and operational conditions influence vessel speed is crucial for characterizing navigational conditions in the Arctic. We analyzed Automatic Identification System (AIS) data from 2010-2019 to examine vessel speed over ground (SOG). Over half of the AIS records showed zero SOG, and treating zero and positive SOG as a single continuous process can obscure important patterns. We therefore applied a two-stage machine learning framework, first modeling the probability of SOG greater than zero and then modeling SOG conditional on being positive. AIS observations were integrated with sea ice concentration, course over ground, wind, bathymetric depth, distance to coast, vessel group, and navigational status. Gradient boosted decision trees with random effects captured nonlinear environmental responses while accounting for repeated observations. The positive SOG classifier achieved strong discrimination (AUC = 0.85), while the conditional speed model explained approximately 77 percent of out-of-fold variance. SHAP values quantified covariate effects by decomposing model predictions into additive contributions from individual variables. Distance to coast and bathymetric depth were dominant determinants of both the likelihood and magnitude of vessel speed, while changes in course, vessel group, and navigational status introduced secondary variation. Wind and sea ice effects were modest. Together, these results empirically characterize Arctic vessel operating regimes relevant to speed management and corridor-level assessment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a two-stage gradient boosted mixed-model machine learning framework applied to 2010-2019 AIS data for vessel speed over ground (SOG) in the U.S. Arctic. Over half the records are zero SOG; the authors therefore first model the probability of positive SOG and then regress SOG conditional on it being positive. Covariates include sea ice concentration, wind, bathymetric depth, distance to coast, vessel group, navigational status, and course over ground. Random effects are included to handle repeated observations. The positive-SOG classifier reports AUC = 0.85 and the conditional model explains ~77% out-of-fold variance. SHAP decomposition identifies distance to coast and bathymetric depth as dominant drivers of both stages, with modest effects from wind and sea ice.
Significance. If the central data-handling assumptions are validated, the work supplies a reproducible, interpretable empirical characterization of Arctic vessel operating regimes that is directly relevant to speed management and corridor assessment. Credit is due for the explicit two-stage separation, out-of-fold validation, inclusion of random effects, and post-hoc SHAP attribution, all of which move beyond black-box prediction toward mechanistic insight.
major comments (2)
- [Abstract and §3] Abstract and §3 (Data and Methods): the claim that zero SOG records represent genuine stationary vessels is load-bearing for the two-stage separation. The manuscript states that >50% of records are zero but provides no validation against independent sources (e.g., satellite imagery, port logs, or transmission-gap statistics) nor any sensitivity analysis that reclassifies a plausible fraction of zeros as missing. If a non-negligible share of zeros are transmission artifacts—especially plausible given patchy Arctic satellite coverage—the classifier may be learning data completeness rather than navigational regime, rendering the reported AUC = 0.85 and the SHAP attributions for distance-to-coast, bathymetry, and sea ice confounded.
- [§4 and §5] §4 (Results) and §5 (Discussion): the 77% out-of-fold variance and AUC = 0.85 are presented without quantitative assessment of residual spatial or temporal autocorrelation beyond the random effects. AIS observations are clustered by vessel and by geographic cell; if the random-effect structure does not fully capture this dependence, the reported performance metrics and the dominance ordering of SHAP values may be optimistically biased.
minor comments (2)
- [§3] §3: specify the exact hyperparameter search procedure and the software implementation (e.g., XGBoost or LightGBM version) used for the gradient-boosted trees.
- [Figure 3] Figure 3 (SHAP summary plots): add axis scales and units for the environmental covariates so that the magnitude of effects can be directly compared across variables.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review of our manuscript. The comments highlight important considerations regarding data interpretation and model validation that we will address in the revision. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Data and Methods): the claim that zero SOG records represent genuine stationary vessels is load-bearing for the two-stage separation. The manuscript states that >50% of records are zero but provides no validation against independent sources (e.g., satellite imagery, port logs, or transmission-gap statistics) nor any sensitivity analysis that reclassifies a plausible fraction of zeros as missing. If a non-negligible share of zeros are transmission artifacts—especially plausible given patchy Arctic satellite coverage—the classifier may be learning data completeness rather than navigational regime, rendering the reported AUC = 0.85 and the SHAP attributions for distance-to-coast, bathymetry, and sea ice confounded.
Authors: We agree that the interpretation of zero SOG records as stationary vessels is central to the two-stage approach and that the manuscript would benefit from additional robustness checks. While standard practice in AIS analysis treats zero SOG as indicating a stationary vessel (consistent with continued transmission while not under way), we acknowledge the referee's concern about possible transmission artifacts in regions with intermittent satellite coverage. To directly address this, we will add a sensitivity analysis in the revised §3: we will randomly reclassify 10% and 20% of zero-SOG records as missing, re-fit both stages of the model, and report changes in AUC, out-of-fold R², and SHAP rankings. We will also expand the discussion in §5 to note this limitation and the results of the sensitivity checks. These additions will clarify the robustness of the reported performance metrics and covariate attributions. revision: yes
-
Referee: [§4 and §5] §4 (Results) and §5 (Discussion): the 77% out-of-fold variance and AUC = 0.85 are presented without quantitative assessment of residual spatial or temporal autocorrelation beyond the random effects. AIS observations are clustered by vessel and by geographic cell; if the random-effect structure does not fully capture this dependence, the reported performance metrics and the dominance ordering of SHAP values may be optimistically biased.
Authors: We recognize that vessel-level random effects alone may not fully capture spatial clustering of observations within geographic cells. In the revised manuscript we will add quantitative residual diagnostics to §4, including (i) temporal autocorrelation functions of residuals stratified by vessel and (ii) spatial variograms or Moran's I statistics computed on residuals aggregated to 0.1° grid cells. These diagnostics will be reported alongside the existing performance metrics. If notable residual dependence remains, we will discuss its implications for the reported AUC and variance-explained values and for the stability of the SHAP ordering, and we will consider whether additional spatial random effects or block cross-validation would be feasible in future extensions. revision: yes
Circularity Check
No significant circularity; standard two-stage ML with out-of-fold evaluation
full rationale
The paper implements a conventional two-part gradient boosted model (classifier for P(SOG>0) followed by conditional regression) on AIS data merged with environmental covariates. Performance is reported via AUC on the classifier and out-of-fold R² on the speed model, with post-hoc SHAP attribution. No derivation step reduces by construction to a fitted parameter, self-citation chain, or renamed input; the central claims are empirical fits evaluated on held-out folds. The zero-SOG modeling choice is an explicit assumption whose validity is external to the reported metrics and does not create definitional circularity.
Axiom & Free-Parameter Ledger
free parameters (2)
- Gradient boosting hyperparameters
- Random effect variance components
axioms (2)
- domain assumption AIS records accurately capture true speed over ground without systematic transmission gaps or errors
- domain assumption Environmental covariates are measured at scales relevant to vessel speed decisions
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We therefore applied a two-stage machine learning framework, first modeling the probability of SOG greater than zero and then modeling SOG conditional on being positive... Gradient boosted decision trees with random effects... SHAP values quantified covariate effects
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Distance to coast and bathymetric depth were dominant determinants of both the likelihood and magnitude of vessel speed
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
On the future navigability of arctic sea routes: high-resolution projections of the arctic ocean and sea ice. Marine Policy 75, 300–317. doi:10.1016/j.marpol.2015.12.027. Chang, K.Y., He, S.S., Chou, C.C., Kao, S.L., Chiou, A.S.,
- [2]
-
[3]
Era5 post-processed daily-statistics on single levels from 1940 to present. doi:10.24381/cds. 4991cf48. accessed on 28-Sep-2025. Feinstein, A.R., Cicchetti, D.V.,
-
[4]
the 35 problems of two paradoxes
High agreement but low kappa: I. the 35 problems of two paradoxes. Journal of Clinical Epidemiology 43, 543–549. doi:10.1016/0895-4356(90)90158-L. Goerlandt, F., Montewka, J., Zhang, W., Kujala, P.,
-
[5]
Potential impacts of shipping noise on marine mammals in the western canadian arctic. Marine Pollution Bulletin 123, 73–82. doi:10.1016/j. marpolbul.2017.09.027. Hauser, D.D.W., Laidre, K.L., Stern, H.L., Moore, S.E., Suydam, R.S.,
work page doi:10.1016/j 2017
-
[6]
Habitat selection by two beluga whale populations in the chukchi and beaufort seas. PLOS ONE 13, e0203657. doi:10.1371/journal.pone. 0203657. HSVA Arctic Technology,
-
[7]
Journal of Climate 34, 2923–2939
Improvements of the daily optimum inter- polation sea surface temperature (doisst) version 2.1. Journal of Climate 34, 2923–2939. doi:10.1175/JCLI-D-20-0166.1. Japkowicz, N., Stephen, S.,
-
[8]
Montewka, J., Goerlandt, F., Kujala, P., Lensu, M.,
URL:https: //www.mdpi.com/2076-3417/14/18/8484, doi:10.3390/app14188484. Montewka, J., Goerlandt, F., Kujala, P., Lensu, M.,
-
[9]
Distribution of endemic cetaceans in relation to hy- drocarbon development and commercial shipping in the canadian arctic. Marine Policy 44, 375–389. doi:10.1016/j.marpol.2013.10.005. Sigrist, F.,
-
[10]
Gaussian process boosting. doi:10.48550/arXiv.2004. 02653. U.S. Committee on the Marine Transportation System,
-
[11]
A Ten-Year Projection of Maritime Activity in the U.S. Arctic Region, 2020–2030. Technical Report. U.S. Committee on the Marine Transportation System. Washington, D.C. Wang, J., Guo, Y., Wang, Y.,
work page 2020
-
[13]
JournalofMarineScienceandEngineering11
The study of fishing vessel behavior identification based on ais data: A case study of the eastchinasea. JournalofMarineScienceandEngineering11. URL:https: //www.mdpi.com/2077-1312/11/5/1093, doi:10.3390/jmse11051093. Yang, L., Chen, G., Zhao, J., Rytter, N.G.M.,
-
[14]
37 Yang, Y., Liu, Y., Li, G., Zhang, Z., Liu, Y.,
URL:https://www.mdpi.com/ 2071-1050/12/9/3649, doi:10.3390/su12093649. 37 Yang, Y., Liu, Y., Li, G., Zhang, Z., Liu, Y.,
-
[15]
Transportation Research Part E: Logistics and Trans- portation Review 183, 103426
Harnessing the power of machine learning for ais data-driven maritime research: A compre- hensive review. Transportation Research Part E: Logistics and Trans- portation Review 183, 103426. URL:https://www.sciencedirect. com/science/article/pii/S1366554524000164, doi:https://doi.org/ 10.1016/j.tre.2024.103426. Zhou, Y., Daamen, W., Vellinga, T., Hoogendoorn, S.P.,
-
[16]
Im- pacts of wind and current on ship behavior in ports and waterways: A quantitative analysis based on ais data. Ocean Engineering 213, 107774. URL:https://www.sciencedirect.com/science/article/ pii/S0029801820307514, doi:https://doi.org/10.1016/j.oceaneng. 2020.107774. Appendix A. Figure A1:Maximum recommended vessel speed as a function of sea ice con- ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.