An Attention-Assisted Multi-Modal Data Fusion Model for Real-Time Estimation of Underwater Sound Velocity
Pith reviewed 2026-05-23 03:00 UTC · model grok-4.3
The pith
A self-attention multimodal CNN estimates real-time underwater sound speed profiles from sea surface temperature and historical data without onsite measurements.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The SA-MDF-CNN fuses remote sensing SST data, principal components of historical SSPs, and spatial coordinates through CNNs for local feature extraction and self-attention for global correlation extraction, producing real-time SSP estimates that avoid any requirement for underwater onsite data collection.
What carries the argument
Self-attention embedded multimodal data fusion convolutional neural network (SA-MDF-CNN) that maps SST, historical SSP principal components, and coordinates to current sound speed profiles.
If this is right
- Underwater communication and positioning systems can operate with continuously updated velocity fields without deploying sensors at the site.
- Estimation becomes feasible in regions where physical access for measurement is restricted or delayed.
- The same input sources enable repeated updates as new SST observations arrive.
- Performance gains in RMSE and robustness appear across the tested state-of-the-art comparisons.
Where Pith is reading between the lines
- The same fusion structure could be retrained on other ocean variables such as salinity or density that also influence acoustic propagation.
- Satellite SST streams could feed an operational pipeline that refreshes SSP estimates at the revisit rate of the sensor.
- Generalization tests in additional ocean basins would reveal whether the learned mapping transfers beyond the original study region.
Load-bearing premise
The relationship between remote sensing SST data, historical SSP primary component characteristics, and spatial coordinates is sufficient for the model to accurately predict current SSP distributions in new task areas without direct measurements.
What would settle it
Direct comparison of the model's predicted SSP values against simultaneous in-situ CTD measurements collected in a previously unseen geographic area, checking whether the resulting RMSE exceeds that of the baseline methods tested in the paper.
Figures
read the original abstract
The estimation of underwater sound velocity distribution serves as a critical basis for facilitating effective underwater communication and precise positioning, given that variations in sound velocity influence the path of signal transmission. Conventional techniques for the direct measurement of sound velocity, as well as methods that involve the inversion of sound velocity utilizing acoustic field data, necessitate on--site data collection. This requirement not only places high demands on device deployment, but also presents challenges in achieving real-time estimation of sound velocity distribution. In order to construct a real-time sound velocity field and eliminate the need for underwater onsite data measurement operations, we propose a self-attention embedded multimodal data fusion convolutional neural network (SA-MDF-CNN) for real-time underwater sound speed profile (SSP) estimation. The proposed model seeks to elucidate the inherent relationship between remote sensing sea surface temperature (SST) data, the primary component characteristics of historical SSPs, and their spatial coordinates. This is achieved by employing CNNs and attention mechanisms to extract local and global correlations from the input data, respectively. The ultimate objective is to facilitate a rapid and precise estimation of sound velocity distribution within a specified task area. Experimental results show that the method proposed in this paper has lower root mean square error (RMSE) and stronger robustness than other state-of-the-art methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a self-attention embedded multimodal data fusion convolutional neural network (SA-MDF-CNN) for real-time underwater sound speed profile (SSP) estimation. It fuses remote-sensing sea surface temperature (SST), historical SSP principal components, and spatial coordinates via CNNs for local features and attention for global correlations, with the goal of eliminating on-site acoustic measurements. The central claim is that the model elucidates the inherent relationship among these inputs and delivers lower RMSE plus stronger robustness than state-of-the-art methods on unspecified test data.
Significance. If the performance claims are substantiated with reproducible experiments, the approach would address a practical bottleneck in underwater acoustics by enabling real-time SSP fields from readily available remote-sensing and historical data. This could benefit applications in communication, positioning, and sonar that currently require costly in-situ profiles. The multimodal attention design is a plausible way to capture both local and long-range dependencies, but its value depends on whether the learned mapping generalizes beyond the training distribution.
major comments (3)
- [Abstract / Experimental Results] Abstract and Experimental Results section: the headline claim of lower RMSE and stronger robustness than SOTA methods is presented without any description of the datasets (size, geographic coverage, temporal span), training/validation splits, baseline implementations, error bars, or statistical significance tests. This absence makes the central empirical assertion unverifiable and load-bearing for the paper's contribution.
- [Method / Experimental Results] Method and Experimental Results sections: the generalization claim—that the mapping from SST + historical SSP PCs + coordinates suffices for accurate SSP prediction in new task areas without local measurements—is not supported by any cross-basin, cross-season, or temporal hold-out experiments. If region-specific oceanographic factors are not captured by the inputs, the reported error reductions will not transfer, directly undermining the real-time estimation objective.
- [§3] §3 (model description): the paper states that the model 'elucidates the inherent relationship' between the three input modalities, yet provides no ablation studies isolating the contribution of each modality or of the self-attention module, leaving open whether the performance gain is due to the architecture or to dataset-specific correlations.
minor comments (2)
- [Abstract] Abstract: 'on--site' contains a typographical double dash; standardize to 'on-site'.
- [Introduction] Notation: the abbreviation 'SSP' is introduced but the relationship between 'sound velocity' and 'sound speed' is used interchangeably without explicit definition; adopt consistent terminology.
Simulated Author's Rebuttal
We appreciate the referee's detailed feedback, which highlights important areas for improving the clarity and rigor of our experimental validation. Below, we respond to each major comment and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: the headline claim of lower RMSE and stronger robustness than SOTA methods is presented without any description of the datasets (size, geographic coverage, temporal span), training/validation splits, baseline implementations, error bars, or statistical significance tests. This absence makes the central empirical assertion unverifiable and load-bearing for the paper's contribution.
Authors: We acknowledge that the current manuscript lacks sufficient detail on the experimental setup, which is necessary for reproducibility and verification of the claims. In the revised version, we will expand the Experimental Results section to include comprehensive descriptions of the datasets (including size, geographic coverage, and temporal span), the training/validation/test splits, details on how baselines were implemented, error bars (e.g., standard deviations across multiple runs), and results of statistical significance tests (such as paired t-tests or Wilcoxon tests) comparing our method to SOTA approaches. revision: yes
-
Referee: [Method / Experimental Results] Method and Experimental Results sections: the generalization claim—that the mapping from SST + historical SSP PCs + coordinates suffices for accurate SSP prediction in new task areas without local measurements—is not supported by any cross-basin, cross-season, or temporal hold-out experiments. If region-specific oceanographic factors are not captured by the inputs, the reported error reductions will not transfer, directly undermining the real-time estimation objective.
Authors: The paper's claim is primarily for estimation within a specified task area using available historical data for that region, rather than claiming universal generalization across all basins without any adaptation. However, to strengthen the evidence, we will include additional experiments using temporal hold-out sets and cross-validation across different seasons within the dataset. We note that the inclusion of historical SSP principal components is intended to capture region-specific characteristics, but we agree that explicit cross-basin tests would further support broader applicability and will discuss this limitation in the revised manuscript. revision: partial
-
Referee: [§3] §3 (model description): the paper states that the model 'elucidates the inherent relationship' between the three input modalities, yet provides no ablation studies isolating the contribution of each modality or of the self-attention module, leaving open whether the performance gain is due to the architecture or to dataset-specific correlations.
Authors: We agree that ablation studies are important to demonstrate the contribution of each component. In the revised manuscript, we will add ablation experiments that systematically remove or replace each input modality (SST, historical SSP PCs, coordinates) and the self-attention module, reporting the resulting RMSE changes to quantify their individual impacts. revision: yes
Circularity Check
No circularity: standard ML training/evaluation on held-out data with no self-referential reductions
full rationale
The paper describes a CNN+attention model (SA-MDF-CNN) trained to map remote-sensing SST, historical SSP principal components, and spatial coordinates to SSP estimates. The reported performance (lower RMSE vs SOTA) is obtained by standard supervised training and test-set evaluation; no equations, uniqueness theorems, or fitted parameters are redefined as independent predictions. No self-citations are used to justify core modeling choices, and the derivation chain consists entirely of empirical feature extraction and regression without any reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- network weights and hyperparameters
axioms (1)
- domain assumption The inherent relationship between SST, historical SSP components, and coordinates can be extracted by CNNs and attention to enable accurate SSP prediction.
Reference graph
Works this paper leans on
-
[1]
A survey o f ar- chitectures and localization techniques for underwater ac oustic sensor networks,
M. Erol-Kantarci, H. T. Mouftah, and S. Oktug, “A survey o f ar- chitectures and localization techniques for underwater ac oustic sensor networks,” IEEE Commun. Surv. Tutor . , vol. 13, no. 3, pp. 487–502, Mar., 2011
work page 2011
-
[2]
Localization algor ithm for underwater sensor network: A review,
J. Luo, Y . Y ang, Z. Wang, and Y . Chen, “Localization algor ithm for underwater sensor network: A review,” IEEE Internet Things J. , vol. 8, no. 17, pp. 13 126–13 144, Sep., 2021
work page 2021
-
[3]
A. Jehangir, S. M. Majid Ashraf, R. Amin Khalil, and N. Sae ed, “Isac- enabled underwater iot network localization: Overcoming a synchrony, mobility, and stratification issues,” IEEE Open J. Commun. Soc. , vol. 5, pp. 3277–3288, May, 2024
work page 2024
-
[4]
Collaborating ray tracing and ai model for auv-assisted 3- d underwater sound-speed inversion,
W. Huang, M. Liu, D. Li, F. Yin, H. Chen, J. Zhou, and H. Xu, “Collaborating ray tracing and ai model for auv-assisted 3- d underwater sound-speed inversion,” IEEE J. Ocean. Eng. , vol. 46, no. 4, pp. 1372– 1390, May, 2021
work page 2021
-
[5]
T. Zhang, L. Y an, G. Han, and Y . Peng, “Fast and accurate un derwater acoustic horizontal ranging algorithm for an arbitrary sou nd-speed profile in the deep sea,” IEEE Internet Things J. , vol. 9, no. 1, pp. 755–769, Jun., 2022
work page 2022
-
[6]
X. Y u, H.-D. Qin, and Z.-B. Zhu, “Underwater localizatio n of auvs in motion using two-way travel time measurements with unknown sound velocity,” IEEE Trans. V eh. Technol., vol. 72, no. 9, pp. 11 358–11 373, May, 2023
work page 2023
-
[7]
Y . Liu, Y . Wang, C. Chen, and C. Liu, “Unified underwater ac oustic localization and sound speed estimation for an isogradient sound speed profile,” IEEE Sens. J. , vol. 24, no. 3, pp. 3317–3327, Dec., 2024
work page 2024
-
[8]
An experimental benchmark for geoacoustic inversion methods ,
J. Bonnel, S. P . Pecknold, P . C. Hines, and N. R. Chapman, “ An experimental benchmark for geoacoustic inversion methods ,” IEEE J. Ocean. Eng. , vol. 46, no. 1, pp. 261–282, Jan., 2021
work page 2021
-
[9]
Ge oacoustic inversion using simple hand-deployable acoustic systems,
J. Bonnel, A. R. McNeese, P . S. Wilson, and S. E. Dosso, “Ge oacoustic inversion using simple hand-deployable acoustic systems, ” IEEE J. Ocean. Eng. , vol. 48, no. 2, pp. 592–603, Nov., 2023
work page 2023
-
[10]
P . Wu, J. Sun, G. Shan, Z. Sun, and P . Wei, “Inversion of de ep-water velocity using the munk formula and the seabed reflection tra veltime: An inversion scheme that takes the complex seabed topograph y into account,” IEEE Trans. Geosci. Remote Sensing , vol. 61, pp. 1–14, May, 2023
work page 2023
-
[11]
Underwater ssp measurement and estimation: A survey,
W. Huang, P . Wu, J. Lu, J. Lu, Z. Xiu, Z. Xu, S. Li, and T. Xu, “Underwater ssp measurement and estimation: A survey,” J. Mar . Sci. Eng., vol. 12, no. 12, Dec., 2024
work page 2024
-
[12]
An estimation method for s ound speed profile based on large depth array multipath delay,
X. Feng, C. Chen, and K. Y ang, “An estimation method for s ound speed profile based on large depth array multipath delay,” IEEE Geosci. Remote Sens. Lett. , vol. 21, pp. 1–5, Jun., 2024
work page 2024
-
[13]
Acoustic tomo graphy via matched field processing,
A. Tolstoy, O. Diachok, and L. N. Frazer, “Acoustic tomo graphy via matched field processing,” J. Acoust. Soc. Am. , vol. 89, no. 3, pp. 1119– 1127, 03 Mar., 1991
work page 1991
-
[14]
Compressive sound speed profile in version using beamforming results,
Y . Choo and W. Seong, “Compressive sound speed profile in version using beamforming results,” Remote Sens. , vol. 10, no. 5, May, 2018
work page 2018
-
[15]
Dictionary learning of soun d speed profiles,
M. Bianco and P . Gerstoft, “Dictionary learning of soun d speed profiles,” J. Acoust. Soc. Am. , vol. 141, no. 3, pp. 1749–1758, 03 Mar., 2017
work page 2017
-
[16]
A meta-deep- learning framework for spatio-temporal underwater ssp inversion,
W. Huang, D. Li, H. Zhang, T. Xu, and F. Yin, “A meta-deep- learning framework for spatio-temporal underwater ssp inversion,” Front. Mar . Sci., vol. 10, Aug., 2023
work page 2023
-
[17]
A multi-spatial scale ocean sound speed predicti on method based on deep learning,
Y . Liu, B. Ma, Z. Qin, C. Wang, C. Guo, S. Y ang, J. Zhao, Y . C ai, and M. Li, “A multi-spatial scale ocean sound speed predicti on method based on deep learning,” J. Mar . Sci. Eng. , vol. 12, no. 11, Oct., 2024. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XX 2025 15
work page 2024
-
[18]
Dynamic prediction of ful l-ocean depth ssp by a hierarchical lstm: An experimental result,
J. Lu, W. Huang, and H. Zhang, “Dynamic prediction of ful l-ocean depth ssp by a hierarchical lstm: An experimental result,” IEEE Geosci. Remote Sens. Lett. , vol. 21, pp. 1–5, Jan., 2024
work page 2024
-
[19]
Ad aptive sound velocity profile prediction method based on deep reinf orcement learning,
X. Cui, X. Liu, J. Li, L. Li, B. Jiang, S. Li, and J. Liu, “Ad aptive sound velocity profile prediction method based on deep reinf orcement learning,” IEEE Sens. Lett. , vol. 8, no. 3, pp. 1–4, Feb., 2024
work page 2024
-
[20]
Improvements of the daily optimum in terpo- lation sea surface temperature (doisst) version 2.1,
B. Huang, C. Liu, V . Banzon, E. Freeman, G. Graham, B. Han kins, T. Smith, and H. Zhang, “Improvements of the daily optimum in terpo- lation sea surface temperature (doisst) version 2.1,” J. Clim. , vol. 34, no. 8, pp. 2923 – 2939, Apr., 2021
work page 2021
-
[21]
Enh anced inversion of sound speed profile based on a physics-inspired self- organizing map,
G. Xu, K. Qu, Z. Li, Z. Zhang, P . Xu, D. Gao, and X. Dai, “Enh anced inversion of sound speed profile based on a physics-inspired self- organizing map,” Remote Sens. , vol. 17, no. 1, Jan., 2025
work page 2025
-
[22]
Development of hig h accuracy ctd sensor: 5el-ctd,
K. Kirimoto, J. Han, and S. Konashi, “Development of hig h accuracy ctd sensor: 5el-ctd,” in OCEANS 2024 - Singapore , 2024, pp. 1–8
work page 2024
-
[23]
Analysis o f glider motion effects on pumped ctd,
C. Luo, Y . Wang, C. Wang, M. Y ang, and S. Y ang, “Analysis o f glider motion effects on pumped ctd,” in OCEANS 2023 - Limerick , 2023, pp. 1–7
work page 2023
-
[24]
Ocean acoustic tomography: a sch eme for large scale monitoring,
W. Munk and C. Wunsch, “Ocean acoustic tomography: a sch eme for large scale monitoring,” Deep-Sea Res. Part I-Oceanogr . Res. Pap. , vol. 26, no. 2, pp. 123–161, Feb., 1979
work page 1979
-
[25]
Ocean acoustic tomography: Rays and modes,
——, “Ocean acoustic tomography: Rays and modes,” Rev. Geophys. , vol. 21, no. 4, pp. 777–793, May, 1983
work page 1983
-
[26]
Inversion of sound speed profile in shallow water with irregular seabed,
W. Zhang, S.-e. Y ang, Y .-w. Huang, and L. Li, “Inversion of sound speed profile in shallow water with irregular seabed,” AIP Conf. Proc. , vol. 1495, no. 1, pp. 392–399, 11 Nov., 2012
work page 2012
-
[27]
M. Zhang, W. Xu, and Y . Xu, “Inversion of the sound speed w ith radiated noise of an autonomous underwater vehicle in shall ow water waveguides,” IEEE J. Ocean. Eng. , vol. 41, no. 1, pp. 204–216, Apr., 2016
work page 2016
-
[28]
Deep learning and process unde rstanding for data-driven earth system science,
M. Reichstein, G. Camps-V alls, B. Stevens, M. Jung, J. D enzler, N. Carvalhais, and Prabhat, “Deep learning and process unde rstanding for data-driven earth system science,” Nature, vol. 566, pp. 195–204, Feb., 2019
work page 2019
-
[29]
Internet of underwater things and big marine data analytics—a compre hensive survey,
M. Jahanbakht, W. Xiang, L. Hanzo, and M. Rahimi Azghadi , “Internet of underwater things and big marine data analytics—a compre hensive survey,” IEEE Commun. Surv. Tutor ., vol. 23, no. 2, pp. 904–956, Jan., 2021
work page 2021
-
[30]
Sound velocity profile predict ion method based on rbf neural network,
X. Y u, T. Xu, and J. Wang, “Sound velocity profile predict ion method based on rbf neural network,” in China Satellite Navigation Conference (CSNC) 2020 Proceedings: V olume III. Singapore: Springer Singapore, Jun., 2020, pp. 475–487
work page 2020
-
[31]
Daily high-resolution- blended analyses for sea surface temperature,
R. W. Reynolds, T. M. Smith, C. Liu, D. B. Chelton, K. S. Casey, and M. G. Schlax, “Daily high-resolution- blended analyses for sea surface temperature,” J. Clim. , vol. 20, no. 22, pp. 5473 – 5496, Nov., 2007. [Online]. Availa ble: https://journals.ametsoc.org/view/journals/clim/20/22/2007jcli1824.1.xml
work page 2007
-
[32]
C. Xie, X. Miaomiao, S. Cao, Y . Zhang, and C. Zhang, “Grid ded argo data set based on gdcsm analysis technique: establishm ent and preliminary applications,” Journal of Marine Sciences, , vol. 37, no. 4, pp. 24–35, 2019
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.