pith. sign in

arxiv: 2502.12817 · v3 · submitted 2025-02-18 · 📡 eess.SP · cs.SD

An Attention-Assisted Multi-Modal Data Fusion Model for Real-Time Estimation of Underwater Sound Velocity

Pith reviewed 2026-05-23 03:00 UTC · model grok-4.3

classification 📡 eess.SP cs.SD
keywords underwater sound velocitysound speed profilemultimodal fusionself-attentionconvolutional neural networkreal-time estimationremote sensingsea surface temperature
0
0 comments X

The pith

A self-attention multimodal CNN estimates real-time underwater sound speed profiles from sea surface temperature and historical data without onsite measurements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a model to predict the full distribution of underwater sound velocity in a task area by learning relationships between remote sensing sea surface temperature, the main features of past sound speed profiles, and location coordinates. Conventional methods require direct underwater measurements or acoustic inversions that demand equipment deployment and prevent real-time use. The approach applies convolutional layers to capture local patterns in the input data and self-attention to link global relationships across the modalities. If effective, the model removes the need for on-site collection while delivering lower error than prior techniques.

Core claim

The SA-MDF-CNN fuses remote sensing SST data, principal components of historical SSPs, and spatial coordinates through CNNs for local feature extraction and self-attention for global correlation extraction, producing real-time SSP estimates that avoid any requirement for underwater onsite data collection.

What carries the argument

Self-attention embedded multimodal data fusion convolutional neural network (SA-MDF-CNN) that maps SST, historical SSP principal components, and coordinates to current sound speed profiles.

If this is right

  • Underwater communication and positioning systems can operate with continuously updated velocity fields without deploying sensors at the site.
  • Estimation becomes feasible in regions where physical access for measurement is restricted or delayed.
  • The same input sources enable repeated updates as new SST observations arrive.
  • Performance gains in RMSE and robustness appear across the tested state-of-the-art comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fusion structure could be retrained on other ocean variables such as salinity or density that also influence acoustic propagation.
  • Satellite SST streams could feed an operational pipeline that refreshes SSP estimates at the revisit rate of the sensor.
  • Generalization tests in additional ocean basins would reveal whether the learned mapping transfers beyond the original study region.

Load-bearing premise

The relationship between remote sensing SST data, historical SSP primary component characteristics, and spatial coordinates is sufficient for the model to accurately predict current SSP distributions in new task areas without direct measurements.

What would settle it

Direct comparison of the model's predicted SSP values against simultaneous in-situ CTD measurements collected in a previously unseen geographic area, checking whether the resulting RMSE exceeds that of the baseline methods tested in the paper.

Figures

Figures reproduced from arXiv: 2502.12817 by Hao Zhang, Pengfei Wu, Wei Huang, Yujie Shi.

Figure 1
Figure 1. Figure 1: SSP Estimation Structure based on SA-MDF-CNN. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The data fusion structure for a single coordinate. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: SA-MDF-CNN model. the fusion data formed by 8 surrounding coordinates will be the input data X, and the center SSP of each sub-grid ψ will be taken as the output label Y . If the coordinate of the grid center is LY = Ln,m, then the coordinates of the input data are represented as: LX = [Ln−1,m−1, Ln−1,m, Ln−1,m+1, Ln,m−1, Ln,m+1, Ln+1,m−1, Ln+1,m, Ln+1,m+1], n ∈ N, m ∈ M. (1) For a specific single coordina… view at source ↗
Figure 4
Figure 4. Figure 4: The principle of the multi-head self-attention mech [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Remote sensing SST data. reconstruct the sound velocity field [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Fusion data of remote sensing SST data, latitude and l [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Evaluation of real-time SSP estimation outcomes acr [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A comparison example of real-time estimated SSP resu [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of estimation results of different algor [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of error distributions of real-time est [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Interpretability analysis of the SA-MDF-CNN model [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of RMSE convergence of different algorithms. of SA-MDF-CNN is smoother than that of CNN, which may [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Data sampling locations of ocean experiments. [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Evaluation of real-time SSP estimation outcomes ut [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: RMSE comparison of real-time estimated SSPs by diffe [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗
read the original abstract

The estimation of underwater sound velocity distribution serves as a critical basis for facilitating effective underwater communication and precise positioning, given that variations in sound velocity influence the path of signal transmission. Conventional techniques for the direct measurement of sound velocity, as well as methods that involve the inversion of sound velocity utilizing acoustic field data, necessitate on--site data collection. This requirement not only places high demands on device deployment, but also presents challenges in achieving real-time estimation of sound velocity distribution. In order to construct a real-time sound velocity field and eliminate the need for underwater onsite data measurement operations, we propose a self-attention embedded multimodal data fusion convolutional neural network (SA-MDF-CNN) for real-time underwater sound speed profile (SSP) estimation. The proposed model seeks to elucidate the inherent relationship between remote sensing sea surface temperature (SST) data, the primary component characteristics of historical SSPs, and their spatial coordinates. This is achieved by employing CNNs and attention mechanisms to extract local and global correlations from the input data, respectively. The ultimate objective is to facilitate a rapid and precise estimation of sound velocity distribution within a specified task area. Experimental results show that the method proposed in this paper has lower root mean square error (RMSE) and stronger robustness than other state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a self-attention embedded multimodal data fusion convolutional neural network (SA-MDF-CNN) for real-time underwater sound speed profile (SSP) estimation. It fuses remote-sensing sea surface temperature (SST), historical SSP principal components, and spatial coordinates via CNNs for local features and attention for global correlations, with the goal of eliminating on-site acoustic measurements. The central claim is that the model elucidates the inherent relationship among these inputs and delivers lower RMSE plus stronger robustness than state-of-the-art methods on unspecified test data.

Significance. If the performance claims are substantiated with reproducible experiments, the approach would address a practical bottleneck in underwater acoustics by enabling real-time SSP fields from readily available remote-sensing and historical data. This could benefit applications in communication, positioning, and sonar that currently require costly in-situ profiles. The multimodal attention design is a plausible way to capture both local and long-range dependencies, but its value depends on whether the learned mapping generalizes beyond the training distribution.

major comments (3)
  1. [Abstract / Experimental Results] Abstract and Experimental Results section: the headline claim of lower RMSE and stronger robustness than SOTA methods is presented without any description of the datasets (size, geographic coverage, temporal span), training/validation splits, baseline implementations, error bars, or statistical significance tests. This absence makes the central empirical assertion unverifiable and load-bearing for the paper's contribution.
  2. [Method / Experimental Results] Method and Experimental Results sections: the generalization claim—that the mapping from SST + historical SSP PCs + coordinates suffices for accurate SSP prediction in new task areas without local measurements—is not supported by any cross-basin, cross-season, or temporal hold-out experiments. If region-specific oceanographic factors are not captured by the inputs, the reported error reductions will not transfer, directly undermining the real-time estimation objective.
  3. [§3] §3 (model description): the paper states that the model 'elucidates the inherent relationship' between the three input modalities, yet provides no ablation studies isolating the contribution of each modality or of the self-attention module, leaving open whether the performance gain is due to the architecture or to dataset-specific correlations.
minor comments (2)
  1. [Abstract] Abstract: 'on--site' contains a typographical double dash; standardize to 'on-site'.
  2. [Introduction] Notation: the abbreviation 'SSP' is introduced but the relationship between 'sound velocity' and 'sound speed' is used interchangeably without explicit definition; adopt consistent terminology.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed feedback, which highlights important areas for improving the clarity and rigor of our experimental validation. Below, we respond to each major comment and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: the headline claim of lower RMSE and stronger robustness than SOTA methods is presented without any description of the datasets (size, geographic coverage, temporal span), training/validation splits, baseline implementations, error bars, or statistical significance tests. This absence makes the central empirical assertion unverifiable and load-bearing for the paper's contribution.

    Authors: We acknowledge that the current manuscript lacks sufficient detail on the experimental setup, which is necessary for reproducibility and verification of the claims. In the revised version, we will expand the Experimental Results section to include comprehensive descriptions of the datasets (including size, geographic coverage, and temporal span), the training/validation/test splits, details on how baselines were implemented, error bars (e.g., standard deviations across multiple runs), and results of statistical significance tests (such as paired t-tests or Wilcoxon tests) comparing our method to SOTA approaches. revision: yes

  2. Referee: [Method / Experimental Results] Method and Experimental Results sections: the generalization claim—that the mapping from SST + historical SSP PCs + coordinates suffices for accurate SSP prediction in new task areas without local measurements—is not supported by any cross-basin, cross-season, or temporal hold-out experiments. If region-specific oceanographic factors are not captured by the inputs, the reported error reductions will not transfer, directly undermining the real-time estimation objective.

    Authors: The paper's claim is primarily for estimation within a specified task area using available historical data for that region, rather than claiming universal generalization across all basins without any adaptation. However, to strengthen the evidence, we will include additional experiments using temporal hold-out sets and cross-validation across different seasons within the dataset. We note that the inclusion of historical SSP principal components is intended to capture region-specific characteristics, but we agree that explicit cross-basin tests would further support broader applicability and will discuss this limitation in the revised manuscript. revision: partial

  3. Referee: [§3] §3 (model description): the paper states that the model 'elucidates the inherent relationship' between the three input modalities, yet provides no ablation studies isolating the contribution of each modality or of the self-attention module, leaving open whether the performance gain is due to the architecture or to dataset-specific correlations.

    Authors: We agree that ablation studies are important to demonstrate the contribution of each component. In the revised manuscript, we will add ablation experiments that systematically remove or replace each input modality (SST, historical SSP PCs, coordinates) and the self-attention module, reporting the resulting RMSE changes to quantify their individual impacts. revision: yes

Circularity Check

0 steps flagged

No circularity: standard ML training/evaluation on held-out data with no self-referential reductions

full rationale

The paper describes a CNN+attention model (SA-MDF-CNN) trained to map remote-sensing SST, historical SSP principal components, and spatial coordinates to SSP estimates. The reported performance (lower RMSE vs SOTA) is obtained by standard supervised training and test-set evaluation; no equations, uniqueness theorems, or fitted parameters are redefined as independent predictions. No self-citations are used to justify core modeling choices, and the derivation chain consists entirely of empirical feature extraction and regression without any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The model rests on the domain assumption that SST and historical SSP patterns suffice for prediction; network weights and hyperparameters are fitted parameters but not enumerated in the abstract.

free parameters (1)
  • network weights and hyperparameters
    Standard for any CNN; fitted during training on historical data.
axioms (1)
  • domain assumption The inherent relationship between SST, historical SSP components, and coordinates can be extracted by CNNs and attention to enable accurate SSP prediction.
    Invoked in the model design and objective described in the abstract.

pith-pipeline@v0.9.0 · 5766 in / 1226 out tokens · 24822 ms · 2026-05-23T03:00:11.908636+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    A survey o f ar- chitectures and localization techniques for underwater ac oustic sensor networks,

    M. Erol-Kantarci, H. T. Mouftah, and S. Oktug, “A survey o f ar- chitectures and localization techniques for underwater ac oustic sensor networks,” IEEE Commun. Surv. Tutor . , vol. 13, no. 3, pp. 487–502, Mar., 2011

  2. [2]

    Localization algor ithm for underwater sensor network: A review,

    J. Luo, Y . Y ang, Z. Wang, and Y . Chen, “Localization algor ithm for underwater sensor network: A review,” IEEE Internet Things J. , vol. 8, no. 17, pp. 13 126–13 144, Sep., 2021

  3. [3]

    Isac- enabled underwater iot network localization: Overcoming a synchrony, mobility, and stratification issues,

    A. Jehangir, S. M. Majid Ashraf, R. Amin Khalil, and N. Sae ed, “Isac- enabled underwater iot network localization: Overcoming a synchrony, mobility, and stratification issues,” IEEE Open J. Commun. Soc. , vol. 5, pp. 3277–3288, May, 2024

  4. [4]

    Collaborating ray tracing and ai model for auv-assisted 3- d underwater sound-speed inversion,

    W. Huang, M. Liu, D. Li, F. Yin, H. Chen, J. Zhou, and H. Xu, “Collaborating ray tracing and ai model for auv-assisted 3- d underwater sound-speed inversion,” IEEE J. Ocean. Eng. , vol. 46, no. 4, pp. 1372– 1390, May, 2021

  5. [5]

    Fast and accurate un derwater acoustic horizontal ranging algorithm for an arbitrary sou nd-speed profile in the deep sea,

    T. Zhang, L. Y an, G. Han, and Y . Peng, “Fast and accurate un derwater acoustic horizontal ranging algorithm for an arbitrary sou nd-speed profile in the deep sea,” IEEE Internet Things J. , vol. 9, no. 1, pp. 755–769, Jun., 2022

  6. [6]

    Underwater localizatio n of auvs in motion using two-way travel time measurements with unknown sound velocity,

    X. Y u, H.-D. Qin, and Z.-B. Zhu, “Underwater localizatio n of auvs in motion using two-way travel time measurements with unknown sound velocity,” IEEE Trans. V eh. Technol., vol. 72, no. 9, pp. 11 358–11 373, May, 2023

  7. [7]

    Unified underwater ac oustic localization and sound speed estimation for an isogradient sound speed profile,

    Y . Liu, Y . Wang, C. Chen, and C. Liu, “Unified underwater ac oustic localization and sound speed estimation for an isogradient sound speed profile,” IEEE Sens. J. , vol. 24, no. 3, pp. 3317–3327, Dec., 2024

  8. [8]

    An experimental benchmark for geoacoustic inversion methods ,

    J. Bonnel, S. P . Pecknold, P . C. Hines, and N. R. Chapman, “ An experimental benchmark for geoacoustic inversion methods ,” IEEE J. Ocean. Eng. , vol. 46, no. 1, pp. 261–282, Jan., 2021

  9. [9]

    Ge oacoustic inversion using simple hand-deployable acoustic systems,

    J. Bonnel, A. R. McNeese, P . S. Wilson, and S. E. Dosso, “Ge oacoustic inversion using simple hand-deployable acoustic systems, ” IEEE J. Ocean. Eng. , vol. 48, no. 2, pp. 592–603, Nov., 2023

  10. [10]

    Inversion of de ep-water velocity using the munk formula and the seabed reflection tra veltime: An inversion scheme that takes the complex seabed topograph y into account,

    P . Wu, J. Sun, G. Shan, Z. Sun, and P . Wei, “Inversion of de ep-water velocity using the munk formula and the seabed reflection tra veltime: An inversion scheme that takes the complex seabed topograph y into account,” IEEE Trans. Geosci. Remote Sensing , vol. 61, pp. 1–14, May, 2023

  11. [11]

    Underwater ssp measurement and estimation: A survey,

    W. Huang, P . Wu, J. Lu, J. Lu, Z. Xiu, Z. Xu, S. Li, and T. Xu, “Underwater ssp measurement and estimation: A survey,” J. Mar . Sci. Eng., vol. 12, no. 12, Dec., 2024

  12. [12]

    An estimation method for s ound speed profile based on large depth array multipath delay,

    X. Feng, C. Chen, and K. Y ang, “An estimation method for s ound speed profile based on large depth array multipath delay,” IEEE Geosci. Remote Sens. Lett. , vol. 21, pp. 1–5, Jun., 2024

  13. [13]

    Acoustic tomo graphy via matched field processing,

    A. Tolstoy, O. Diachok, and L. N. Frazer, “Acoustic tomo graphy via matched field processing,” J. Acoust. Soc. Am. , vol. 89, no. 3, pp. 1119– 1127, 03 Mar., 1991

  14. [14]

    Compressive sound speed profile in version using beamforming results,

    Y . Choo and W. Seong, “Compressive sound speed profile in version using beamforming results,” Remote Sens. , vol. 10, no. 5, May, 2018

  15. [15]

    Dictionary learning of soun d speed profiles,

    M. Bianco and P . Gerstoft, “Dictionary learning of soun d speed profiles,” J. Acoust. Soc. Am. , vol. 141, no. 3, pp. 1749–1758, 03 Mar., 2017

  16. [16]

    A meta-deep- learning framework for spatio-temporal underwater ssp inversion,

    W. Huang, D. Li, H. Zhang, T. Xu, and F. Yin, “A meta-deep- learning framework for spatio-temporal underwater ssp inversion,” Front. Mar . Sci., vol. 10, Aug., 2023

  17. [17]

    A multi-spatial scale ocean sound speed predicti on method based on deep learning,

    Y . Liu, B. Ma, Z. Qin, C. Wang, C. Guo, S. Y ang, J. Zhao, Y . C ai, and M. Li, “A multi-spatial scale ocean sound speed predicti on method based on deep learning,” J. Mar . Sci. Eng. , vol. 12, no. 11, Oct., 2024. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XX 2025 15

  18. [18]

    Dynamic prediction of ful l-ocean depth ssp by a hierarchical lstm: An experimental result,

    J. Lu, W. Huang, and H. Zhang, “Dynamic prediction of ful l-ocean depth ssp by a hierarchical lstm: An experimental result,” IEEE Geosci. Remote Sens. Lett. , vol. 21, pp. 1–5, Jan., 2024

  19. [19]

    Ad aptive sound velocity profile prediction method based on deep reinf orcement learning,

    X. Cui, X. Liu, J. Li, L. Li, B. Jiang, S. Li, and J. Liu, “Ad aptive sound velocity profile prediction method based on deep reinf orcement learning,” IEEE Sens. Lett. , vol. 8, no. 3, pp. 1–4, Feb., 2024

  20. [20]

    Improvements of the daily optimum in terpo- lation sea surface temperature (doisst) version 2.1,

    B. Huang, C. Liu, V . Banzon, E. Freeman, G. Graham, B. Han kins, T. Smith, and H. Zhang, “Improvements of the daily optimum in terpo- lation sea surface temperature (doisst) version 2.1,” J. Clim. , vol. 34, no. 8, pp. 2923 – 2939, Apr., 2021

  21. [21]

    Enh anced inversion of sound speed profile based on a physics-inspired self- organizing map,

    G. Xu, K. Qu, Z. Li, Z. Zhang, P . Xu, D. Gao, and X. Dai, “Enh anced inversion of sound speed profile based on a physics-inspired self- organizing map,” Remote Sens. , vol. 17, no. 1, Jan., 2025

  22. [22]

    Development of hig h accuracy ctd sensor: 5el-ctd,

    K. Kirimoto, J. Han, and S. Konashi, “Development of hig h accuracy ctd sensor: 5el-ctd,” in OCEANS 2024 - Singapore , 2024, pp. 1–8

  23. [23]

    Analysis o f glider motion effects on pumped ctd,

    C. Luo, Y . Wang, C. Wang, M. Y ang, and S. Y ang, “Analysis o f glider motion effects on pumped ctd,” in OCEANS 2023 - Limerick , 2023, pp. 1–7

  24. [24]

    Ocean acoustic tomography: a sch eme for large scale monitoring,

    W. Munk and C. Wunsch, “Ocean acoustic tomography: a sch eme for large scale monitoring,” Deep-Sea Res. Part I-Oceanogr . Res. Pap. , vol. 26, no. 2, pp. 123–161, Feb., 1979

  25. [25]

    Ocean acoustic tomography: Rays and modes,

    ——, “Ocean acoustic tomography: Rays and modes,” Rev. Geophys. , vol. 21, no. 4, pp. 777–793, May, 1983

  26. [26]

    Inversion of sound speed profile in shallow water with irregular seabed,

    W. Zhang, S.-e. Y ang, Y .-w. Huang, and L. Li, “Inversion of sound speed profile in shallow water with irregular seabed,” AIP Conf. Proc. , vol. 1495, no. 1, pp. 392–399, 11 Nov., 2012

  27. [27]

    Inversion of the sound speed w ith radiated noise of an autonomous underwater vehicle in shall ow water waveguides,

    M. Zhang, W. Xu, and Y . Xu, “Inversion of the sound speed w ith radiated noise of an autonomous underwater vehicle in shall ow water waveguides,” IEEE J. Ocean. Eng. , vol. 41, no. 1, pp. 204–216, Apr., 2016

  28. [28]

    Deep learning and process unde rstanding for data-driven earth system science,

    M. Reichstein, G. Camps-V alls, B. Stevens, M. Jung, J. D enzler, N. Carvalhais, and Prabhat, “Deep learning and process unde rstanding for data-driven earth system science,” Nature, vol. 566, pp. 195–204, Feb., 2019

  29. [29]

    Internet of underwater things and big marine data analytics—a compre hensive survey,

    M. Jahanbakht, W. Xiang, L. Hanzo, and M. Rahimi Azghadi , “Internet of underwater things and big marine data analytics—a compre hensive survey,” IEEE Commun. Surv. Tutor ., vol. 23, no. 2, pp. 904–956, Jan., 2021

  30. [30]

    Sound velocity profile predict ion method based on rbf neural network,

    X. Y u, T. Xu, and J. Wang, “Sound velocity profile predict ion method based on rbf neural network,” in China Satellite Navigation Conference (CSNC) 2020 Proceedings: V olume III. Singapore: Springer Singapore, Jun., 2020, pp. 475–487

  31. [31]

    Daily high-resolution- blended analyses for sea surface temperature,

    R. W. Reynolds, T. M. Smith, C. Liu, D. B. Chelton, K. S. Casey, and M. G. Schlax, “Daily high-resolution- blended analyses for sea surface temperature,” J. Clim. , vol. 20, no. 22, pp. 5473 – 5496, Nov., 2007. [Online]. Availa ble: https://journals.ametsoc.org/view/journals/clim/20/22/2007jcli1824.1.xml

  32. [32]

    Grid ded argo data set based on gdcsm analysis technique: establishm ent and preliminary applications,

    C. Xie, X. Miaomiao, S. Cao, Y . Zhang, and C. Zhang, “Grid ded argo data set based on gdcsm analysis technique: establishm ent and preliminary applications,” Journal of Marine Sciences, , vol. 37, no. 4, pp. 24–35, 2019