A Multimodal Data Fusion Attention-Empowered Generative Adversarial Network for Real Time 3D Underwater Sound Speed Field Construction

Fang Ji; Hao Zhang; Jixuan Zhou; Qian Sun; Tianhe Xu; Wei Huang; Yuqiang Huang

arxiv: 2507.11812 · v4 · submitted 2025-07-16 · 💻 cs.SD · eess.AS· eess.SP

A Multimodal Data Fusion Attention-Empowered Generative Adversarial Network for Real Time 3D Underwater Sound Speed Field Construction

Wei Huang , Yuqiang Huang , Jixuan Zhou , Hao Zhang , Tianhe Xu , Qian Sun , Fang Ji This is my paper

Pith reviewed 2026-05-19 05:08 UTC · model grok-4.3

classification 💻 cs.SD eess.ASeess.SP

keywords sound speed profilegenerative adversarial networkmultimodal fusionunderwater acoustics3D reconstructionattention mechanismresidual attentionsea surface temperature

0 comments

The pith

A generative adversarial network fused with multimodal surface data reconstructs 3D underwater sound speed fields to within 0.3 m/s error without any underwater sonar readings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a model called MDF-RAGAN that uses sea surface temperature and other surface observations to build accurate three-dimensional maps of sound speed underwater. It does this by combining generative adversarial networks with attention mechanisms to spot spatial patterns and residual blocks to detect small changes in velocity caused by surface conditions. If successful, this approach would let researchers and engineers get high-quality sound speed data for acoustic systems without deploying expensive underwater equipment. The experiments on real data show it beats previous methods like basic neural networks and simple interpolation, cutting errors substantially. This matters for applications like underwater communication and positioning that depend on knowing how sound travels through water.

Core claim

The MDF-RAGAN architecture integrates multimodal data fusion with residual attention blocks to capture global spatial correlations and extract subtle deep-ocean sound velocity perturbations from sea surface temperature variations, enabling accurate 3D sound speed field reconstruction solely from surface observations.

What carries the argument

Multimodal data-fusion generative adversarial network enhanced with residual attention blocks (MDF-RAGAN), which uses attention to capture spatial features and residuals to model perturbations from surface data.

If this is right

Sound speed profiles can be reconstructed in real time for underwater acoustic applications without on-site measurements.
The model achieves estimation errors below 0.3 m/s on public datasets.
It reduces RMSE by nearly half compared to CNN and spatial interpolation methods.
It provides a 65.8% RMSE reduction over the mean profile method.
Multi-source fusion and cross-modal attention improve accuracy and robustness of sound speed reconstruction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar surface-to-depth fusion techniques might extend to reconstructing other ocean properties like temperature or salinity profiles.
Integrating additional surface sensors such as salinity or wind data could further refine the velocity estimates in varying conditions.
Deployment on autonomous surface vehicles could enable continuous monitoring of sound speed fields over large areas.
The approach may reduce costs for marine acoustic surveys by minimizing reliance on submerged sensors.

Load-bearing premise

Sea surface temperature variations and other multimodal surface observations can capture the subtle changes in deep ocean sound velocity well enough for accurate reconstruction.

What would settle it

Collecting direct underwater sonar measurements in the same locations and comparing them to the model's 3D field predictions; if the differences exceed 0.3 m/s on average, the claim would not hold.

Figures

Figures reproduced from arXiv: 2507.11812 by Fang Ji, Hao Zhang, Jixuan Zhou, Qian Sun, Tianhe Xu, Wei Huang, Yuqiang Huang.

**Figure 2.** Figure 2: The proposed MDF-RAGAN model for SSP estimation, which consists of a [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of intermediate features in the MDF-RAGAN model. (a)-(c) [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of sound speed profile predictions at different locations and depths [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of sound speed profiles at different locations and depths. Each [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: t-SNE visualization of intermediate features in MDF-RAGAN. (a)-(c) show the [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: ECDF comparison of absolute prediction errors for different methods at various [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

read the original abstract

Sound speed profiles (SSPs) are crucial underwater parameters that determine the propagation patterns of acoustic signals, directly influencing the energy efficiency of underwater communication and the accuracy of positioning systems. Conventional techniques for obtaining SSPs, such as matched field processing (MFP), compressive sensing (CS), and deep learning (DL), typically depend on on-site sonar measurements, which impose stringent requirements on the deployment of underwater observation systems. To overcome this limitation and enable high-precision sound speed field reconstruction without the need for on-site underwater data collection, we propose a novel multimodal data-fusion generative adversarial network enhanced with residual attention blocks (MDF-RAGAN). This architecture integrates attention mechanisms to capture global spatial feature correlations effectively, while residual modules are employed to extract subtle perturbations in deep-ocean sound velocity distribution caused by sea surface temperature (SST) variations. Experimental results on a public real-world dataset demonstrate that the proposed model outperforms other state-of-the-art methods, achieving an estimation error of less than 0.3 m/s. Specifically, MDF-RAGAN reduces the root mean square error (RMSE) by nearly half compared to convolutional neural network (CNN) and spatial interpolation (SITP) methods, and attains a 65.8\% RMSE reduction relative to the mean profile method. These results highlight the effectiveness of multi-source fusion and cross-modal attention in enhancing the accuracy and robustness of sound speed profile reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper puts together a GAN with attention and residual blocks to reconstruct 3D sound speed fields from surface multimodal data, reporting solid RMSE gains on one public dataset, but the central premise that surface observations can stand in for direct deep measurements stays untested.

read the letter

The main takeaway is that this work assembles existing deep-learning components into MDF-RAGAN for real-time 3D underwater sound speed field construction using only surface data like SST and other multimodal inputs, without on-site sonar. On a public dataset it claims error below 0.3 m/s, roughly halving RMSE versus CNN and spatial interpolation baselines and cutting it by 65.8 percent versus the mean profile. That is a concrete empirical result worth noting for anyone working on underwater acoustics applications where deploying sensors is costly.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MDF-RAGAN, a multimodal data-fusion generative adversarial network incorporating residual attention blocks, to construct real-time 3D underwater sound speed fields from surface observations such as sea surface temperature without requiring on-site underwater sonar measurements. Attention mechanisms capture global spatial correlations while residual modules extract subtle deep-ocean velocity perturbations. On a public real-world dataset the model is reported to outperform CNN, SITP and mean-profile baselines, achieving estimation error below 0.3 m/s, nearly halving RMSE relative to CNN/SITP and a 65.8% RMSE reduction versus the mean profile.

Significance. If the empirical results prove robust and the surface-to-deep proxy relationship holds, the work could materially reduce dependence on expensive underwater observation infrastructure, benefiting acoustic communication efficiency and positioning accuracy. The technical combination of cross-modal attention and residual blocks for multimodal fusion is a plausible direction for ocean-acoustic reconstruction tasks.

major comments (2)

Abstract: the central performance claims (error <0.3 m/s, RMSE halved vs. CNN/SITP, 65.8% reduction vs. mean profile) are presented without any description of training procedures, validation splits, error bars, ablation studies or statistical testing. These details are load-bearing for the claim that the model outperforms state-of-the-art methods.
Abstract: the reconstruction claim rests on the untested premise that sea-surface temperature and other multimodal surface observations suffice to capture subtle deep-ocean sound-velocity perturbations. No physical derivation, sensitivity analysis, or comparison against independent deep measurements (e.g., CTD casts) is supplied to substantiate generalization beyond dataset-specific correlations.

minor comments (2)

The abstract would benefit from explicit listing of the additional multimodal surface data sources beyond SST that are fused by the model.
Notation for network components (MDF-RAGAN, residual attention blocks) is introduced clearly but should be expanded with a diagram or pseudocode in the methods section to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: the central performance claims (error <0.3 m/s, RMSE halved vs. CNN/SITP, 65.8% reduction vs. mean profile) are presented without any description of training procedures, validation splits, error bars, ablation studies or statistical testing. These details are load-bearing for the claim that the model outperforms state-of-the-art methods.

Authors: We agree that the abstract, constrained by length, does not detail the experimental protocol. The full manuscript covers training procedures (Section 3.2), dataset splits and cross-validation (Section 4.1), ablation studies (Section 4.3), and comparative results with error metrics. In the revision we will append a concise clause to the abstract referencing the validation framework and ensure error bars appear on all reported performance figures. revision: yes
Referee: Abstract: the reconstruction claim rests on the untested premise that sea-surface temperature and other multimodal surface observations suffice to capture subtle deep-ocean sound-velocity perturbations. No physical derivation, sensitivity analysis, or comparison against independent deep measurements (e.g., CTD casts) is supplied to substantiate generalization beyond dataset-specific correlations.

Authors: The MDF-RAGAN model is trained end-to-end on a public dataset containing paired surface and in-situ underwater observations, allowing it to learn empirical correlations. While we do not derive a new first-principles physical model, the architecture is motivated by established oceanographic links between SST and sound-speed variability. We will add an explicit sensitivity analysis subsection and a limitations paragraph discussing dataset-specific generalization. Direct comparison against additional independent CTD casts lies outside the present data resources and will be flagged as future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical ML model evaluation on external dataset

full rationale

The paper introduces MDF-RAGAN, a generative adversarial network architecture that fuses multimodal surface observations to reconstruct 3D underwater sound speed fields, and reports empirical RMSE reductions on a public real-world dataset (error <0.3 m/s, ~50% better than CNN/SITP, 65.8% better than mean profile). No mathematical derivation chain, first-principles equations, or predictions are presented that reduce by construction to fitted inputs or self-citations. The central claims rest on standard supervised training and held-out evaluation rather than any self-definitional mapping or load-bearing self-citation. This is the expected non-circular outcome for an applied neural-network paper whose results are falsifiable against the cited dataset.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that surface multimodal observations suffice as proxies for deep sound speed structure, plus standard neural-network training assumptions. Numerous free parameters exist inside the attention and residual blocks but are not enumerated in the abstract.

free parameters (2)

Attention and residual block weights
Learnable parameters inside the attention mechanisms and residual modules are optimized on the training data to achieve the reported RMSE reductions.
Multimodal fusion coefficients
Parameters that control the relative contribution of different surface data streams (e.g., SST) are fitted during training.

axioms (1)

domain assumption Sea surface temperature variations and other multimodal surface data can capture subtle perturbations in deep-ocean sound velocity distribution.
This premise underpins the claim that reconstruction is possible without on-site underwater measurements.

pith-pipeline@v0.9.0 · 5814 in / 1566 out tokens · 54849 ms · 2026-05-19T05:08:47.745462+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

residual modules for deeply capturing small disturbances in the deep ocean sound velocity distribution caused by changes of SST
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cross-modal perturbation attention block... Q, K, V projections and scaled dot-product attention

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

Remote Sensing 10, 1–18

Compressive sound speed profile inversion using beamforming results. Remote Sensing 10, 1–18. doi:10.3390/rs10050704. Erol-Kantarci, M., Mouftah, H.T., Oktug, S.,

work page doi:10.3390/rs10050704
[2]

IEEE Communications Surveys & Tutorials 13, 487–502

A survey of ar- chitectures and localization techniques for underwater acoustic sensor networks. IEEE Communications Surveys & Tutorials 13, 487–502. doi:10.1109/SURV.2011.020211.00035. Huang, B., Liu, C., Banzon, V., Freeman, E., Graham, G., Han- kins, B., Smith, T., Zhang, H.M., 2021a. Improvements of the daily optimum interpolation sea surface temperat...

work page doi:10.1109/surv.2011.020211.00035 2011
[3]

Journal of Climate , author =

URL: https://journals.ametsoc.org/view/journals/clim/34/8/JCLI-D-20-0166.1.xml, doi:10.1175/JCLI-D-20-0166.1. Huang, W., Li, D., Zhang, H., Xu, T., Yin, F.,

work page doi:10.1175/jcli-d-20-0166.1
[4]

Frontiers in Ma- rine Science 10, 1–22

A meta-deep-learning framework for spatio-temporal underwater ssp inversion. Frontiers in Ma- rine Science 10, 1–22. doi:10.3389/fmars.2023.1146333. Huang, W., Liu, M., Li, D., Yin, F., Chen, H., Zhou, J., Xu, H., 2021b. Collaborating ray tracing and ai model for auv-assisted 3-d underwater sound-speed inversion. IEEE Journal of Oceanic Engineering 46, 1372–

work page doi:10.3389/fmars.2023.1146333 2023
[5]

Chinese Physics Letters 27, 084303:1–4

Inversion for sound speed profile by using a bottom mounted horizontal line array in shallow water. Chinese Physics Letters 27, 084303:1–4. doi:10.1088/0256-307X/27/8/084303. Li, H., Qu, K., Zhou, J., Aug.,

work page doi:10.1088/0256-307x/27/8/084303
[6]

IEEE Access 9, 109754–109762

Reconstructing sound speed profile from remote sensing data: Nonlinear inversion based on self-organizing map. IEEE Access 9, 109754–109762. doi:10.1109/ACCESS.2021.3102608. Li Hong, Xu Fanghua, e.a.,

work page doi:10.1109/access.2021.3102608 2021
[7]

IEEE Geosci

Dynamic prediction of full-ocean depth ssp by a hierarchical lstm: An experimental result. IEEE Geosci. Remote Sens. Lett. 21, 1–5. doi:10.1109/LGRS.2024.3356552. Luo, J., Yang, Y., Wang, Z., Chen, Y.,

work page doi:10.1109/lgrs.2024.3356552 2024
[8]

IEEE Internet of Things Journal 8, 13126–13144

Localization algorithm for underwater sensor network: A review. IEEE Internet of Things Journal 8, 13126–13144. doi:10.1109/JIOT.2021.3081918. Piao, S., Yan, X., Li, Q., Li, Z., Wang, Z., Zhu, J.,

work page doi:10.1109/jiot.2021.3081918 2021
[9]

Ocean Engineering 283, 115058

Time series prediction of shallow water sound speed profile in the pres- ence of internal solitary wave trains. Ocean Engineering 283, 115058. doi:10.1016/j.oceaneng.2023.115058. Piccolo, J., Haramuniz, G., Michalopoulou, Z.H.,

work page doi:10.1016/j.oceaneng.2023.115058 2023
[10]

Conference Proceedings

Inverting tomographic data with neural nets, in: ’Challenges of Our Changing Global Environment’. Conference Proceedings. OCEANS’95 MTS/IEEE, IEEE. pp. 1501–1504. doi:10.1109/OCEANS.1995.528711. Tolstoy, A., Diachok, O., Frazer, L.,

work page doi:10.1109/oceans.1995.528711 1995
[11]

The Journal of the Acoustical Society of America 89, 1119–1127

Acoustic tomography via matched field processing. The Journal of the Acoustical Society of America 89, 1119–1127. doi:10.1121/1.400647. Wang, Y., Cai, W., Weng, D., Sheng, Q.,

work page doi:10.1121/1.400647
[12]

A sbe-19plus based real-time monitoring system of ctd data, in: OCEANS 2014 - TAIPEI, pp. 1–4. Wu, P., Zhang, H., Shi, Y., Lu, J., Li, S., Huang, W., Tang, N., Wang, S.,

work page 2014
[13]

Applied Ocean Research 150, 104088

Real-time estimation of underwater sound speed profiles with a data fusion convolutional neural net- work model. Applied Ocean Research 150, 104088. URL: https://www.sciencedirect.com/science/article/pii/S0141118724002098, doi:https://doi.org/10.1016/j.apor.2024.104088. Zhang, M., Xu, W., Xu, Y.,

work page doi:10.1016/j.apor.2024.104088 2024
[14]

IEEE Journal of Oceanic Engineering 41, 204–216

Inversion of the sound speed with radiated noise of an autonomous underwater vehicle in shallow wa- ter waveguides. IEEE Journal of Oceanic Engineering 41, 204–216. doi:10.1109/JOE.2015.2418172. 30 Zhang, W., Yang, S.e., Huang, Y.w., Li, L.,

work page doi:10.1109/joe.2015.2418172 2015
[15]

Inversion of sound speed profile in shallow water with irregular seabed, in: Advances in Ocean Acoustics: Proceedings of the 3rd International Conference on Ocean Acoustics (OA2012), AIP. pp. 392–399. doi:10.1063/1.4765934. 31

work page doi:10.1063/1.4765934

[1] [1]

Remote Sensing 10, 1–18

Compressive sound speed profile inversion using beamforming results. Remote Sensing 10, 1–18. doi:10.3390/rs10050704. Erol-Kantarci, M., Mouftah, H.T., Oktug, S.,

work page doi:10.3390/rs10050704

[2] [2]

IEEE Communications Surveys & Tutorials 13, 487–502

A survey of ar- chitectures and localization techniques for underwater acoustic sensor networks. IEEE Communications Surveys & Tutorials 13, 487–502. doi:10.1109/SURV.2011.020211.00035. Huang, B., Liu, C., Banzon, V., Freeman, E., Graham, G., Han- kins, B., Smith, T., Zhang, H.M., 2021a. Improvements of the daily optimum interpolation sea surface temperat...

work page doi:10.1109/surv.2011.020211.00035 2011

[3] [3]

Journal of Climate , author =

URL: https://journals.ametsoc.org/view/journals/clim/34/8/JCLI-D-20-0166.1.xml, doi:10.1175/JCLI-D-20-0166.1. Huang, W., Li, D., Zhang, H., Xu, T., Yin, F.,

work page doi:10.1175/jcli-d-20-0166.1

[4] [4]

Frontiers in Ma- rine Science 10, 1–22

A meta-deep-learning framework for spatio-temporal underwater ssp inversion. Frontiers in Ma- rine Science 10, 1–22. doi:10.3389/fmars.2023.1146333. Huang, W., Liu, M., Li, D., Yin, F., Chen, H., Zhou, J., Xu, H., 2021b. Collaborating ray tracing and ai model for auv-assisted 3-d underwater sound-speed inversion. IEEE Journal of Oceanic Engineering 46, 1372–

work page doi:10.3389/fmars.2023.1146333 2023

[5] [5]

Chinese Physics Letters 27, 084303:1–4

Inversion for sound speed profile by using a bottom mounted horizontal line array in shallow water. Chinese Physics Letters 27, 084303:1–4. doi:10.1088/0256-307X/27/8/084303. Li, H., Qu, K., Zhou, J., Aug.,

work page doi:10.1088/0256-307x/27/8/084303

[6] [6]

IEEE Access 9, 109754–109762

Reconstructing sound speed profile from remote sensing data: Nonlinear inversion based on self-organizing map. IEEE Access 9, 109754–109762. doi:10.1109/ACCESS.2021.3102608. Li Hong, Xu Fanghua, e.a.,

work page doi:10.1109/access.2021.3102608 2021

[7] [7]

IEEE Geosci

Dynamic prediction of full-ocean depth ssp by a hierarchical lstm: An experimental result. IEEE Geosci. Remote Sens. Lett. 21, 1–5. doi:10.1109/LGRS.2024.3356552. Luo, J., Yang, Y., Wang, Z., Chen, Y.,

work page doi:10.1109/lgrs.2024.3356552 2024

[8] [8]

IEEE Internet of Things Journal 8, 13126–13144

Localization algorithm for underwater sensor network: A review. IEEE Internet of Things Journal 8, 13126–13144. doi:10.1109/JIOT.2021.3081918. Piao, S., Yan, X., Li, Q., Li, Z., Wang, Z., Zhu, J.,

work page doi:10.1109/jiot.2021.3081918 2021

[9] [9]

Ocean Engineering 283, 115058

Time series prediction of shallow water sound speed profile in the pres- ence of internal solitary wave trains. Ocean Engineering 283, 115058. doi:10.1016/j.oceaneng.2023.115058. Piccolo, J., Haramuniz, G., Michalopoulou, Z.H.,

work page doi:10.1016/j.oceaneng.2023.115058 2023

[10] [10]

Conference Proceedings

Inverting tomographic data with neural nets, in: ’Challenges of Our Changing Global Environment’. Conference Proceedings. OCEANS’95 MTS/IEEE, IEEE. pp. 1501–1504. doi:10.1109/OCEANS.1995.528711. Tolstoy, A., Diachok, O., Frazer, L.,

work page doi:10.1109/oceans.1995.528711 1995

[11] [11]

The Journal of the Acoustical Society of America 89, 1119–1127

Acoustic tomography via matched field processing. The Journal of the Acoustical Society of America 89, 1119–1127. doi:10.1121/1.400647. Wang, Y., Cai, W., Weng, D., Sheng, Q.,

work page doi:10.1121/1.400647

[12] [12]

A sbe-19plus based real-time monitoring system of ctd data, in: OCEANS 2014 - TAIPEI, pp. 1–4. Wu, P., Zhang, H., Shi, Y., Lu, J., Li, S., Huang, W., Tang, N., Wang, S.,

work page 2014

[13] [13]

Applied Ocean Research 150, 104088

Real-time estimation of underwater sound speed profiles with a data fusion convolutional neural net- work model. Applied Ocean Research 150, 104088. URL: https://www.sciencedirect.com/science/article/pii/S0141118724002098, doi:https://doi.org/10.1016/j.apor.2024.104088. Zhang, M., Xu, W., Xu, Y.,

work page doi:10.1016/j.apor.2024.104088 2024

[14] [14]

IEEE Journal of Oceanic Engineering 41, 204–216

Inversion of the sound speed with radiated noise of an autonomous underwater vehicle in shallow wa- ter waveguides. IEEE Journal of Oceanic Engineering 41, 204–216. doi:10.1109/JOE.2015.2418172. 30 Zhang, W., Yang, S.e., Huang, Y.w., Li, L.,

work page doi:10.1109/joe.2015.2418172 2015

[15] [15]

Inversion of sound speed profile in shallow water with irregular seabed, in: Advances in Ocean Acoustics: Proceedings of the 3rd International Conference on Ocean Acoustics (OA2012), AIP. pp. 392–399. doi:10.1063/1.4765934. 31

work page doi:10.1063/1.4765934