pith. sign in

arxiv: 2605.16304 · v2 · pith:GKBFZ2PLnew · submitted 2026-04-24 · 📡 eess.SP · cs.SD

Modulation Feature Enhancement with a Multi-Stage Attention Network for Underwater Acoustic Target Recognition

Pith reviewed 2026-05-22 10:04 UTC · model grok-4.3

classification 📡 eess.SP cs.SD
keywords underwater acoustic target recognitionship-radiated noisevariational mode decompositionattention mechanismclass imbalancemodulation spectrumdeep learningDEMON features
0
0 comments X

The pith

A CNN with multi-stage attention and adjusted focal loss improves recognition of ships from underwater noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to solve the problem of accurately classifying ships based on their complex, noisy underwater radiated sounds, where data often has severe imbalances between ship types. It starts by applying variational mode decomposition to create 3/2-D DEMON spectral features that pull out modulation envelope details more cleanly than standard methods. These features feed into a one-dimensional convolutional network equipped with a multi-stage multi-type attention mechanism that refines information at successive layers using two new attention blocks. An adjustable class-balanced focal loss then compensates for the imbalance during training. Real-world experiments on ship noise recordings show the full pipeline raises recognition performance.

Core claim

The authors claim that combining variational mode decomposition for high-fidelity 3/2-D DEMON spectral features, a 1-D CNN augmented by the Multi-Stage Multi-Type Attention Mechanism with Residual Channel-Independent Spectral Attention and Multi-Scale Separate-and-Fuse Spectral Attention, plus an Adjustable Class-Balanced Focal Loss produces measurably better classification of real ship-radiated acoustic signals despite class imbalance and ocean noise.

What carries the argument

The Multi-Stage Multi-Type Attention Mechanism (MMATT) that adaptively refines spectral features at multiple depths inside the 1-D CNN using residual channel-independent and multi-scale separate-and-fuse spectral attention blocks.

If this is right

  • The extracted modulation features become more discriminative for ship classes under realistic noise.
  • The network can maintain high performance even when training data has large differences in the number of examples per class.
  • Feature refinement happens automatically at several stages rather than only at the input or output.
  • Overall recognition rates rise on maritime surveillance tasks that rely on passive acoustic listening.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same attention blocks could be dropped into other one-dimensional signal networks for tasks such as engine fault detection or speech enhancement.
  • If the VMD-derived spectra prove stable across different ocean conditions, similar decomposition steps may simplify preprocessing pipelines for other noisy time-series problems.
  • The adjustable loss parameter offers a practical knob that future work could tune automatically for new acoustic datasets.

Load-bearing premise

The 3/2-D spectrum produced by variational mode decomposition isolates modulation envelope information that reliably distinguishes ship classes while staying stable against real ocean noise variations.

What would settle it

A head-to-head test on the same real-world ship-radiated noise dataset in which the full proposed pipeline shows no statistically significant accuracy gain over a plain 1-D CNN trained with standard cross-entropy loss.

Figures

Figures reproduced from arXiv: 2605.16304 by Chunjin Jiang, Jiaping Yu, Linlin Mao, Shefeng Yan, Zeping Sui.

Figure 1
Figure 1. Figure 1: 2-D DEMON spectral feature extraction and fusion process [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: 2-D DEMON spectral feature 2.2. Network architecture In this study, the 1-D CNN is employed to construct the recognition model, in which three representative attention mechanisms are incorporated. The architecture of the proposed model is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The framework of the recognition model detailed in Section 2.3. All convolutional modules share the same fundamental architecture, including a 1-D convolutional layer, an activation layer (ReLU), a batch normalization layer, and a pooling layer (max-pooling). Batch normalization is applied to prevent vanishing and exploding gradients, thereby accelerating convergence and improving stability. In the Fully C… view at source ↗
Figure 4
Figure 4. Figure 4: Residual Channel-Independent Spectral Attention Mechanism 4. First, depthwise convolution is applied to independently capture information of each channel. Next, a sigmoid function is used to generate frequency attention weights within the range [0, 1]. Finally, the element-wise product of the attention weight and the input data is computed and then added back to the input. In contrast to the traditional SA… view at source ↗
Figure 5
Figure 5. Figure 5: Multi-Scale Separate-and-Fuse Spectral Attention Mechanism The multi-scale spectral attention weights are first computed per channel and then fused across channels [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Channel attention mechanism 2.4. Adjustable Class-Balanced Focal Loss 2.4.1. Adjustable focal loss The focal loss function proposed in [26] can be given as FL (p) = − (1 − p) γ log (p) (6) where p is the estimated probability of true class, γ ≥ 0 denotes the tunable focusing parameter. For hard misclassified examples where p is small, the focal loss approximates the cross-entropy loss. For easy well-classi… view at source ↗
Figure 7
Figure 7. Figure 7: The distribution of deep features of three models with different input features 4.2. Recognition model evaluation To evaluate model performance on long-tailed ship-radiated noise data, we conduct comparative experiments. The Trad-CNN model from Section 4.1 is used as the baseline. Our VMD-Fusion-CNN-MMATT-ImFL integrates three components: VMD-Fusion (VMD￾based fused 2-D DEMON input), MMATT (Multi-Stage Mul… view at source ↗
Figure 8
Figure 8. Figure 8: Confusion matrices of three recognition models [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The results of comparative experiments [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Heat map of attention weights for R-CISAM 21 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Recognition results of the ablation experiments 4.4. Analysis of Adjustable Class-Balanced Focal Loss The tunability of ACBFL primarily stems from two adjustable parameters, β and q. We investigate their influence on model recognition performance. The parameters β and q range from 0 to 1 and are determined via 5-fold cross-validation on the training set, with optimal values varying across training sets. W… view at source ↗
Figure 12
Figure 12. Figure 12: The results of adjustable parameters and fixed parameters 23 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
read the original abstract

Underwater acoustic target recognition is critical for maritime applications, yet it faces challenges arising from the complex and diverse nature of ship-radiated noise. To address these issues, we propose a robust deep learning-based framework. First, we introduce a feature extraction and fusion method based on variational mode decomposition (VMD) and the 3/2-D spectrum to generate high-fidelity 2-D DEMON spectral features, which effectively capture modulation envelope information. To further enhance feature representation, we design a one-dimensional convolutional neural network (1-D CNN) integrated with a novel Multi-Stage Multi-Type Attention Mechanism (MMATT) that adaptively refines features at different network depths. Within this mechanism, we propose a Residual Channel-Independent Spectral Attention Mechanism (R-CISAM) and a Multi-Scale Separate-and-Fuse Spectral Attention Mechanism (MS-SFSAM). Moreover, to mitigate performance degradation caused by severe class imbalance inherent in real-world ship-radiated noise data, we devise an Adjustable Class-Balanced Focal Loss (ACBFL), which provides flexibility across tasks with varying degrees of imbalance. Experimental results on a real-world ship-radiated noise dataset demonstrate that the proposed solutions effectively enhance underwater acoustic target recognition performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a deep learning framework for underwater acoustic target recognition from ship-radiated noise. It first extracts 2-D DEMON spectral features via variational mode decomposition (VMD) combined with a 3/2-D spectrum to capture modulation envelope information. These features are then processed by a 1-D CNN augmented with a Multi-Stage Multi-Type Attention Mechanism (MMATT) that incorporates Residual Channel-Independent Spectral Attention (R-CISAM) and Multi-Scale Separate-and-Fuse Spectral Attention (MS-SFSAM). To address class imbalance, the authors introduce an Adjustable Class-Balanced Focal Loss (ACBFL). Experiments on a real-world ship-radiated noise dataset are reported to demonstrate performance improvements.

Significance. If the central performance claims hold after addressing the experimental gaps, the work could advance practical underwater acoustic recognition by providing a pipeline that jointly improves modulation feature quality and handles severe class imbalance common in maritime datasets. The explicit design of MMATT components and the adjustable loss offer concrete, implementable ideas that could be tested in other noisy signal domains.

major comments (2)
  1. [Experimental results] Experimental results section: the reported accuracy gains on the real-world dataset are presented without quantitative baselines (e.g., standard DEMON + 1-D CNN), error bars, or ablation studies that isolate the contribution of the VMD + 3/2-D spectrum feature extractor while holding the MMATT 1-D CNN and ACBFL fixed. This omission makes it impossible to attribute the headline improvement to the proposed feature extraction step rather than to the attention modules or loss alone.
  2. [Feature extraction] Feature extraction section (abstract and §3): the claim that the VMD-derived 3/2-D spectrum 'effectively capture[s] modulation envelope information' that is both class-discriminative and robust to ocean noise is asserted but not tested via a controlled swap against a conventional Hilbert-envelope DEMON spectrum or EMD baseline. Without this isolation, the weakest assumption identified in the reader's report remains unaddressed and load-bearing for the central claim.
minor comments (1)
  1. [§3] The notation for the 3/2-D spectrum and the precise definition of the adjustable balancing factor in ACBFL should be given explicitly with equations rather than descriptive text only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental validation that will improve the clarity and rigor of our claims. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Experimental results] Experimental results section: the reported accuracy gains on the real-world dataset are presented without quantitative baselines (e.g., standard DEMON + 1-D CNN), error bars, or ablation studies that isolate the contribution of the VMD + 3/2-D spectrum feature extractor while holding the MMATT 1-D CNN and ACBFL fixed. This omission makes it impossible to attribute the headline improvement to the proposed feature extraction step rather than to the attention modules or loss alone.

    Authors: We agree that the current presentation of results does not sufficiently isolate the contribution of the proposed VMD + 3/2-D spectrum feature extractor. In the revised manuscript we will add quantitative comparisons against a standard DEMON + 1-D CNN baseline, report mean accuracy with standard deviations across multiple random seeds to provide error bars, and include ablation studies that hold the MMATT 1-D CNN and ACBFL fixed while varying only the feature extraction pipeline. These additions will enable clearer attribution of performance gains. revision: yes

  2. Referee: [Feature extraction] Feature extraction section (abstract and §3): the claim that the VMD-derived 3/2-D spectrum 'effectively capture[s] modulation envelope information' that is both class-discriminative and robust to ocean noise is asserted but not tested via a controlled swap against a conventional Hilbert-envelope DEMON spectrum or EMD baseline. Without this isolation, the weakest assumption identified in the reader's report remains unaddressed and load-bearing for the central claim.

    Authors: The design of the VMD + 3/2-D spectrum is motivated by VMD's superior handling of non-stationary signals compared with EMD. We acknowledge, however, that the manuscript does not contain a direct head-to-head comparison against a conventional Hilbert-envelope DEMON spectrum or an EMD-based baseline while keeping the downstream MMATT network and ACBFL fixed. We will perform and report these controlled experiments in the revision to empirically support the claim that the proposed feature extractor yields more class-discriminative and noise-robust modulation information. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical ML proposal with independent components

full rationale

The paper introduces VMD-based 3/2-D spectrum feature extraction, MMATT attention blocks (R-CISAM and MS-SFSAM), and ACBFL loss as externally motivated engineering choices for underwater acoustic data. Performance claims rest on end-to-end experiments on a real-world ship-radiated noise dataset rather than any derivation that reduces accuracy or feature discriminability to a fitted parameter or self-citation by construction. No equations equate outputs to inputs, and cited techniques (VMD, focal loss variants) are standard and independent of the target result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on standard signal-processing assumptions about VMD mode separation and on the empirical effectiveness of attention for spectral features; no new physical entities are postulated.

free parameters (1)
  • Adjustable balancing factor in ACBFL
    The loss contains a tunable parameter that controls class balancing strength and is chosen to suit different imbalance degrees.
axioms (1)
  • domain assumption VMD decomposition plus 3/2-D spectrum produces high-fidelity modulation features
    Invoked in the first paragraph of the abstract as the basis for the 2-D DEMON spectral features.

pith-pipeline@v0.9.0 · 5756 in / 1295 out tokens · 27074 ms · 2026-05-22T10:04:27.848615+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    W. Wang, S. Yan, L. Mao, Z. Sui, J. Yang, Robust direct position determination for chirp signal-based underwater acoustic sensor networks, Signal Processing 230 (2025) 109841. doi:https://doi.org/10.1016/j.sigpro.2024.109841. 27

  2. [2]

    J. Yang, S. Yan, L. Mao, Z. Sui, W. Wang, D. Zeng, Underwater acoustic signal denoising based on sparse TQWT and wavelet thresholding, Digital Signal Processing 153 (2024) 104601. doi:https://doi.org/10.1016/j.dsp.2024.104601

  3. [3]

    Z. Sui, S. Yan, Frequency Channel Equalization Based on Variable Step-Size LMS Al- gorithm for OFDM Underwater Communications, in: 2019 IEEE International Confer- ence on Signal Processing, Communications and Computing (ICSPCC), 2019, pp. 1–5. doi:10.1109/ICSPCC46631.2019.8960813

  4. [4]

    Gradient-based learning applied to document recognition,

    Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324. doi:10.1109/5.726791

  5. [5]

    P. Zhu, Y. Zhang, Y. Huang, C. Zhao, K. Zhao, F. Zhou, Underwater acoustic target recognition based on spectrum component analysis of ship radiated noise, Applied Acoustics 211 (2023) 109552. doi:https://doi.org/10.1016/j.apacoust.2023.109552

  6. [6]

    F. Liu, T. Shen, Z. Luo, D. Zhao, S. Guo, Underwater target recognition using convolutional recurrent neural networks with 3-D Mel-spectrogram and data augmentation, Applied Acoustics 178 (2021) 107989. doi:https://doi.org/10.1016/j.apacoust.2021.107989

  7. [7]

    Pollara, A

    A. Pollara, A. Sutin, H. Salloum, Improvement of the Detection of Envelope Modulation on Noise (DEMON) and its application to small boats, in: OCEANS 2016 MTS/IEEE Monterey, 2016, pp. 1–10. doi:10.1109/OCEANS.2016.7761197

  8. [8]

    Clark, I

    P. Clark, I. Kirsteins, L. Atlas, Multiband analysis for colored amplitude-modulated ship noise, in: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 3970–3973. doi:10.1109/ICASSP.2010.5495776

  9. [9]

    Z. Chen, Q. Liu, Y. Wang, SNR-based weighted fusion algorithm of multiple sub-band DEMON spectrum, in: 2017 IEEE 2nd Advanced Information Technol- ogy, Electronic and Automation Control Conference (IAEAC), 2017, pp. 2305–2308. doi:10.1109/IAEAC.2017.8054432. 28

  10. [10]

    Dragomiretskiy, D

    K. Dragomiretskiy, D. Zosso, Variational Mode Decomposition, IEEE Transactions on Signal Processing 62 (3) (2014) 531–544. doi:10.1109/TSP.2013.2288675

  11. [11]

    C. Zhu, T. Cao, L. Chen, X. Dai, Q. Ge, X. Zhao, High-Order Domain Feature Extraction Technology for Ocean Acoustic Observation Signals: A Review, IEEE Access 11 (2023) 17665–17683. doi:10.1109/ACCESS.2023.3244782

  12. [12]

    Zhang, L

    Q. Zhang, L. Da, Y. Zhang, Y. Hu, Integrated neural networks based on fea- ture fusion for underwater target recognition, Applied Acoustics 182 (2021) 108261. doi:https://doi.org/10.1016/j.apacoust.2021.108261

  13. [13]

    Sichun, Y

    L. Sichun, Y. Desen, DEMON Feature Extraction of Acoustic Vector Signal based on 3/2-D spectrum, in: 2007 2nd IEEE Conference on Industrial Electronics and Applications, 2007, pp. 2239–2243. doi:10.1109/ICIEA.2007.4318809

  14. [14]

    D. Xu, H. Zheng, Q. Hu, A Novel Feature Extraction Method for Underwater Acoustic Target Based on Parameter Optimized VMD and 1(1/2)-D Spectrum, in: Proceedings of the 2020 4th International Conference on Digital Signal Processing, ICDSP ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 86–91. doi:10.1145/3408127.3408149

  15. [15]

    Z. Niu, G. Zhong, H. Yu, A review on the attention mechanism of deep learning, Neurocom- puting 452 (2021) 48–62. doi:https://doi.org/10.1016/j.neucom.2021.03.091

  16. [16]

    L. Zhao, Y. Song, J. Xiong, J. Xu, D. Li, F. Liu, T. Shen, A time-delay neural network for ship-radiated noise recognition based on residual block and attention mechanism, Digital Signal Processing 149 (2024) 104504. doi:https://doi.org/10.1016/j.dsp.2024.104504

  17. [17]

    B. Wang, W. Zhang, Y. Zhu, C. Wu, S. Zhang, An Underwater Acoustic Target Recognition Method Based on AMNet, IEEE Geoscience and Remote Sensing Letters 20 (2023) 1–5. doi:10.1109/LGRS.2023.3235659

  18. [18]

    S. Yang, A. Jin, X. Zeng, H. Wang, X. Hong, M. Lei, Underwater acoustic target recognition based on sub-band concatenated Mel spectrogram and multidomain atten- 29 tion mechanism, Engineering Applications of Artificial Intelligence 133 (2024) 107983. doi:https://doi.org/10.1016/j.engappai.2024.107983

  19. [19]

    S. Woo, J. Park, J.-Y. Lee, I. S. Kweon, CBAM: Convolutional Block Attention Module, in: Computer Vision – ECCV 2018, Springer-Verlag, Berlin, Heidelberg, 2018, p. 3–19. doi:10.1007/978-3-030-01234-2_1

  20. [20]

    B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, Y. Kalantidis, Decoupling representation and classifier for long-tailed recognition, in: Eighth International Conference on Learning Representations (ICLR), 2020

  21. [21]

    J. Wang, W. Zhang, Y. Zang, Y. Cao, J. Pang, T. Gong, K. Chen, Z. Liu, C. C. Loy, D. Lin, Seesaw Loss for Long-Tailed Instance Segmentation, in: 2021 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 9690–9699. doi:10.1109/CVPR46437.2021.00957

  22. [22]

    Z. Wang, G. Cao, X. Xi, J. Wang, OpenNet: Incremental Learning for Autonomous Driving Object Detection with Balanced Loss, in: 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2023, pp. 2675–2682. doi:10.1109/SMC53992.2023.10394429

  23. [23]

    Huang, Y

    C. Huang, Y. Li, C. C. Loy, X. Tang, Learning Deep Representation for Imbalanced Classification, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5375–5384. doi:10.1109/CVPR.2016.580

  24. [24]

    Mikolov, I

    T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, Curran Associates Inc., Red Hook, NY, USA, 2013, p. 3111–3119

  25. [25]

    Mahajan, R

    D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, L. van der Maaten, Exploring the Limits of Weakly Supervised Pretraining, in: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss (Eds.), Computer Vision – ECCV 2018, Springer International Publishing, Cham, 2018, pp. 185–201. 30

  26. [26]

    T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal Loss for Dense Object Detection, in: 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2999–3007. doi:10.1109/ICCV.2017.324

  27. [27]

    Y. Dong, X. Shen, Z. Jiang, H. Wang, Recognition of imbalanced underwater acoustic datasets with exponentially weighted cross-entropy loss, Applied Acoustics 174 (2021) 107740. doi:https://doi.org/10.1016/j.apacoust.2020.107740

  28. [28]

    Y. Ma, M. Liu, Y. Zhang, B. Zhang, K. Xu, B. Zou, Z. Huang, Imbalanced Underwater Acous- tic Target Recognition with Trigonometric Loss and Attention Mechanism Convolutional Network, Remote Sensing 14 (16) (2022). doi:10.3390/rs14164103

  29. [29]

    Santos-Domínguez, S

    D. Santos-Domínguez, S. Torres-Guijarro, A. Cardenal-López, A. Pena-Gimenez, ShipsEar: An underwater vessel noise database, Applied Acoustics 113 (2016) 64–69. doi:https://doi.org/10.1016/j.apacoust.2016.06.008

  30. [30]

    H. I. Hummel, R. van der Mei, S. Bhulai, A survey on machine learning in ship radiated noise, Ocean Engineering 298 (2024) 117252. doi:https://doi.org/10.1016/j.oceaneng.2024.117252

  31. [31]

    J. Hu, L. Shen, G. Sun, Squeeze-and-Excitation Networks, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141. doi:10.1109/CVPR.2018.00745

  32. [32]

    van der Maaten, G

    L. van der Maaten, G. Hinton, Visualizing Data using t-SNE, Journal of Machine Learning Research 9 (86) (2008) 2579–2605. 31