pith. sign in

arxiv: 2605.19955 · v1 · pith:TC6CUW5Xnew · submitted 2026-05-19 · 💻 cs.CR · cs.SD

DASM: Domain-Aware Sharpness Minimization for Multi-Domain Voice Stream Steganalysis

Pith reviewed 2026-05-20 03:55 UTC · model grok-4.3

classification 💻 cs.CR cs.SD
keywords steganalysisvoice streamsharpness minimizationdomain adaptationgeneralizationcontrastive learningloss landscape
0
0 comments X

The pith

A new optimizer called Domain-Aware Sharpness Minimization finds flatter minima while preserving domain separations to improve generalization in voice stream steganalysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that mainstream steganalysis models for network voice streams overfit to specific data distributions because their loss landscapes contain many saddle points and sharp local minima. Hessian analysis reveals this sensitivity to shifts between training and test scenarios. The authors introduce DASM, which combines domain-supervised contrastive learning with sharpness-aware optimization and adds an adaptive strategy that adjusts loss weights according to real-time domain separability. This produces models that maintain detection performance when facing non-homologous streams from different sources or conditions. Readers would care because practical security monitoring must handle varied real-world network traffic without constant retraining.

Core claim

The loss landscapes of current models are dominated by sharp minima that limit generalization across domains. DASM addresses this by integrating domain-supervised contrastive learning to keep inter-domain features separated while pursuing flat minima, together with an adaptive domain gap modulation that senses feature separability and dynamically calibrates optimization weights.

What carries the argument

Domain-Aware Sharpness Minimization (DASM) optimizer, which pairs domain-supervised contrastive learning with sharpness-aware minimization and uses adaptive modulation of domain-gap losses.

If this is right

  • Models trained under DASM maintain higher detection accuracy when test streams come from different sources or network conditions.
  • The optimizer reduces the performance drop caused by changes in data distribution without requiring domain-specific retraining.
  • Detection systems become more reliable for ongoing monitoring of covert communications in diverse streaming environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same combination of contrastive separation and flat-minima search could be tested on image or video steganalysis tasks that also face domain shifts.
  • Adaptive modulation of domain gaps during training might help other security classifiers that encounter changing input statistics over time.

Load-bearing premise

That integrating domain-supervised contrastive learning with sharpness-aware optimization plus adaptive modulation will produce flat minima stable under non-homologous distribution shifts.

What would settle it

A controlled test on a fresh collection of voice streams from unseen networks or codecs where DASM shows no accuracy gain over standard sharpness-aware baselines would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.19955 by Linna Zhou, Mengqin Zhao, Pengcheng Zhou, Pianran Guo, Shuhua Chen, Zhongliang Yang.

Figure 1
Figure 1. Figure 1: Multi-domain Hessian analysis. Eigenvalue spectral density and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of DASM for multi-domain voice steganalysis. (1) Input: Cover and steganographic audio patches with multiple embedding rates. (2) Feature [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: 3-D t-SNE visualization of feature distributions. DASM achieves [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pairwise PAD matrices across embedding rates. Lighter colors indicate smaller domain gaps, correlating with higher detection difficulty. PMS [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Loss landscape at ER=0.5. Top row: Adam converges to sharp minima with pronounced non-convexity in PMS and QIM. Bottom row: DASM [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Detection accuracy across embedding rates from 0.1 to 0.5. DASM consistently outperforms Adam and SAM across all domains. The advantage is [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: t-SNE visualization for Adam. Severe overlap between Cover and Stego features indicates convergence to suboptimal solutions. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: t-SNE visualization for SAM. Improved separation for AHCM and LSB, but PMS remains entangled due to limitations of isotropic perturbations. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: t-SNE visualization for DAEF-VS. Specialized architecture alone yields limited improvement for difficult domains. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: t-SNE visualization for DASM. Clear separation across all domains including PMS validates the effectiveness of domain-aware sharpness minimization. [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ablation Study: Average Accuracy at ER=0.5. [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
read the original abstract

The growing use of information hiding in network streaming media for covert communication poses a significant security threat, necessitating the development of robust detection technologies. However, existing steganalysis methods for network voice streams mostly rely on data distributions in specific scenarios, making it difficult to adapt to the practical detection needs of non-homologous data distributions. Through Hessian analysis, we find that the loss landscapes of mainstream models are dominated by numerous saddle points and sharp local minima, rendering them highly sensitive to data distribution shifts and fundamentally limiting generalization. Therefore, we propose a new optimizer, Domain-Aware Sharpness Minimization (DASM). The core mechanisms of DASM consist of two aspects: first, it integrates domain-supervised contrastive learning with sharpness-aware optimization, explicitly preserving inter-domain feature separation while seeking flat minima; second, we design an adaptive domain gap modulation strategy that dynamically calibrates the optimization loss weights by sensing the real-time feature separability of different domains. Extensive experimental results demonstrate that our method outperforms the state-of-the-art methods by a large margin and achieves excellent generalization and robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that Hessian analysis reveals loss landscapes of mainstream steganalysis models to be dominated by saddle points and sharp local minima, which cause sensitivity to data distribution shifts and limit generalization in multi-domain voice stream steganalysis. It proposes Domain-Aware Sharpness Minimization (DASM) that integrates domain-supervised contrastive learning with sharpness-aware optimization and adds an adaptive domain-gap modulation strategy to dynamically adjust loss weights based on feature separability. The authors state that extensive experiments demonstrate large-margin outperformance over state-of-the-art methods together with improved generalization and robustness.

Significance. If the reported performance gains and the mechanistic link to flatter minima are substantiated, the work could advance practical steganalysis for network voice streams, a relevant security problem involving covert communication. The combination of contrastive domain separation with adaptive sharpness minimization offers a plausible direction for handling non-homologous domain shifts, though its impact depends on verification that the optimizer itself produces the claimed flat-minima property.

major comments (2)
  1. [Abstract] Abstract: The assertion that 'Through Hessian analysis, we find that the loss landscapes of mainstream models are dominated by numerous saddle points and sharp local minima' is presented without any supporting equations, Hessian trace values, eigenvalue spectra, or figures that would allow independent assessment of this observation.
  2. [Experimental results] Experimental results section: No post-training curvature analysis (e.g., leading Hessian eigenvalues, trace, or SAM-style sharpness metric) is reported for DASM-trained models versus SGD/Adam baselines, leaving the causal connection between the proposed optimizer and improved cross-domain generalization unverified; performance differences could arise from the contrastive term or modulation alone.
minor comments (2)
  1. [Method] The description of the adaptive domain gap modulation strategy would benefit from an explicit equation defining how real-time feature separability is quantified and used to calibrate the loss weights.
  2. [Experiments] Tables reporting results on held-out domains should include error bars or standard deviations across multiple runs to support the 'large margin' and 'excellent generalization' claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our Hessian analysis and the verification of DASM's effect on the loss landscape. We address each major comment below and have revised the manuscript to incorporate additional supporting material and analysis.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that 'Through Hessian analysis, we find that the loss landscapes of mainstream models are dominated by numerous saddle points and sharp local minima' is presented without any supporting equations, Hessian trace values, eigenvalue spectra, or figures that would allow independent assessment of this observation.

    Authors: We agree that the abstract claim would be strengthened by immediate supporting details. The full manuscript already contains the Hessian analysis in Section 3.2, where we describe the computation of the Hessian trace and leading eigenvalues via power iteration on a subset of the training data. To allow independent assessment directly from the abstract, we have added a concise reference to the eigenvalue spectra and a pointer to the corresponding figure and equations now placed in the revised introduction. A new figure showing the distribution of the top eigenvalues for representative mainstream models (e.g., SRNet and XuNet) has been included. revision: yes

  2. Referee: [Experimental results] Experimental results section: No post-training curvature analysis (e.g., leading Hessian eigenvalues, trace, or SAM-style sharpness metric) is reported for DASM-trained models versus SGD/Adam baselines, leaving the causal connection between the proposed optimizer and improved cross-domain generalization unverified; performance differences could arise from the contrastive term or modulation alone.

    Authors: We acknowledge that explicit post-training curvature metrics would more directly substantiate the link between DASM and flatter minima. In the revised manuscript we now report, in the experimental results section, the leading Hessian eigenvalues (approximated via Lanczos), the Hessian trace, and a SAM-style sharpness measure (maximum loss within a neighborhood) for DASM-trained models versus SGD and Adam baselines across the multi-domain voice stream datasets. These metrics show consistently lower curvature for DASM. We have also expanded the ablation study to isolate the sharpness-aware component from the domain-supervised contrastive term and the adaptive modulation, confirming that the optimizer contributes measurably to cross-domain generalization beyond the other two elements. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic proposal remains independent of its motivating observations

full rationale

The paper motivates DASM from Hessian-based observations of sharp minima and saddle points in existing models, then introduces an optimizer that combines domain-supervised contrastive learning, sharpness-aware minimization, and adaptive domain-gap modulation as a design choice. No equations, fitted parameters, or self-citations are shown that would reduce the claimed generalization or flat-minima property to a tautology or renamed input; the performance claims rest on experimental results rather than any self-referential derivation. The chain is therefore self-contained as a novel algorithmic contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond the high-level algorithmic description.

pith-pipeline@v0.9.0 · 5734 in / 1222 out tokens · 32892 ms · 2026-05-20T03:55:26.339799+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

  1. [1]

    Detection of pitch modulation information hiding based on codebook correlation network,

    S.-B. Li, Y .-Z. Jia, J.-Y . Fuet al., “Detection of pitch modulation information hiding based on codebook correlation network,”Chinese Journal of Computers, vol. 37, no. 10, pp. 2107–2116, 2014

  2. [2]

    Steganalysis of qim steganography in low-bit-rate speech signals,

    S. Li, Y . Jia, and C.-C. J. Kuo, “Steganalysis of qim steganography in low-bit-rate speech signals,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 1011–1022, 2017

  3. [3]

    Fast detection of heterogeneous parallel steganography for streaming voice,

    H. Wang, Z. Yang, Y . Huet al., “Fast detection of heterogeneous parallel steganography for streaming voice,” inProceedings of the 2021 ACM Workshop on Information Hiding and Multimedia Security. ACM, 2021, pp. 137–142

  4. [4]

    Detection of heterogeneous parallel steganography for low bit-rate voip speech streams,

    Y . Hu, Y . Huang, Z. Yanget al., “Detection of heterogeneous parallel steganography for low bit-rate voip speech streams,”Neurocomputing, vol. 419, pp. 70–79, 2021

  5. [5]

    Frame-level steganalysis of qim steganog- raphy in compressed speech based on multi-dimensional perspective of codeword correlations,

    M. Wei, S. Li, P. Liuet al., “Frame-level steganalysis of qim steganog- raphy in compressed speech based on multi-dimensional perspective of codeword correlations,”Journal of Ambient Intelligence and Humanized Computing, vol. 14, no. 7, pp. 8421–8431, 2023

  6. [6]

    Detection of qim-based steganography in voip streams: A mobilevit-inspired model,

    C. Zhang and S. Jiang, “Detection of qim-based steganography in voip streams: A mobilevit-inspired model,”IEEE Signal Processing Letters, vol. 31, pp. 1735–1739, 2024

  7. [7]

    Daef-vs: An efficient universal voip steganalysis framework based on domain-aware knowledge,

    Z. Fang, P. Zhou, Z. Yang, Z. Zhou, and L. Zhou, “Daef-vs: An efficient universal voip steganalysis framework based on domain-aware knowledge,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

  8. [8]

    Efficient streaming voice steganalysis in challenging detection scenarios,

    P. Zhou, Z. Fang, Z. Yang, Z. Zhou, and L. Zhou, “Efficient streaming voice steganalysis in challenging detection scenarios,”IEEE Transac- tions on Information Forensics and Security, vol. 20, pp. 5966–5977, 2025

  9. [9]

    An approach to information hiding in low bit-rate speech stream,

    B. Xiao, Y . Huang, and S. Tang, “An approach to information hiding in low bit-rate speech stream,” inIEEE GLOBECOM 2008-2008 IEEE global telecommunications conference. IEEE, 2008, pp. 1–5

  10. [12]

    Sharpness-aware minimization for efficiently improving generalization,

    P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,” inInternational Conference on Learning Representations (ICLR), 2021

  11. [13]

    Efficient sharpness-aware minimization for improved training of neural networks,

    J. Du, H. Yan, J. Feng, J. T. Zhou, L. Zhen, R. S. M. Goh, and V . Tan, “Efficient sharpness-aware minimization for improved training of neural networks,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //openreview.net/forum?id=n0OeTdNRG0Q

  12. [14]

    Sharpness-aware gradient matching for domain generalization,

    P. Wang, Z. Zhang, Z. Lei, and L. Zhang, “Sharpness-aware gradient matching for domain generalization,” inCVPR, 2023

  13. [15]

    Friendly-sam: Leveraging stochastic gradient noise for improved sharpness-aware minimization,

    T. Zhang, H. Li, Z. Wanget al., “Friendly-sam: Leveraging stochastic gradient noise for improved sharpness-aware minimization,”Advances in Neural Information Processing Systems (NeurIPS), 2024

  14. [16]

    Escaping saddle points for effective generalization on class-imbalanced data,

    H. Rangwani, S. K. Aithal, M. Mishra, and V . B. R, “Escaping saddle points for effective generalization on class-imbalanced data,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 22 791–22 805

  15. [17]

    Domain-inspired sharpness-aware minimization under domain shifts,

    R. Zhang, Z. Fan, J. Yao, Y . Zhang, and Y . Wang, “Domain-inspired sharpness-aware minimization under domain shifts,”CoRR, vol. abs/2405.18861, 2024. [Online]. Available: https://doi.org/10.48550/ arXiv.2405.18861

  16. [18]

    Dgsam: Do- main generalization via individual sharpness-aware minimization,

    Y . Song, Y . Hwang, J. Lee, H. Lee, and D.-Y . Lim, “Dgsam: Do- main generalization via individual sharpness-aware minimization,”arXiv preprint arXiv:2503.23430, 2025

  17. [19]

    Data concealment in audio using a nonlinear frequency distribution of prbs coded data and frequency- domain lsb insertion,

    T. Cedric, R. Adi, and I. Mcloughlin, “Data concealment in audio using a nonlinear frequency distribution of prbs coded data and frequency- domain lsb insertion,” in2000 TENCON Proceedings. Intelligent Sys- tems and Technologies for the New Millennium (Cat. No.00CH37119), vol. 1, 2000, pp. 275–278 vol.1

  18. [20]

    Increasing the capacity of lsb-based audio steganography,

    N. Cvejic and T. Seppanen, “Increasing the capacity of lsb-based audio steganography,” in2002 IEEE Workshop on Multimedia Signal Processing., 2002, pp. 336–338

  19. [21]

    Hiding secret information using lsb based audio steganography,

    A. Binny and M. Koilakuntla, “Hiding secret information using lsb based audio steganography,” in2014 International Conference on Soft Computing and Machine Intelligence. IEEE, 2014, pp. 56–59

  20. [22]

    Security enhancement of lsb-based audio steganography method,

    J. Jezdimirovi ´c, N. Pekez, and J. Kova ˇcevi´c, “Security enhancement of lsb-based audio steganography method,” in2023 Zooming Innovation in Consumer Technologies Conference (ZINC). IEEE, 2023, pp. 77–82

  21. [23]

    Quantization index modulation: a class of provably good methods for digital watermarking and information embedding,

    B. Chen and G. Wornell, “Quantization index modulation: a class of provably good methods for digital watermarking and information embedding,”IEEE Transactions on Information Theory, vol. 47, no. 4, pp. 1423–1443, 2001

  22. [24]

    An approach to information hiding in low bit-rate speech stream,

    B. Xiao, Y . Huang, and S. Tang, “An approach to information hiding in low bit-rate speech stream,” inIEEE GLOBECOM 2008 - 2008 IEEE Global Telecommunications Conference, 2008, pp. 1–5

  23. [25]

    Lpc parameters substitution for speech information hiding,

    Z. jun WU, W. GAO, and W. Y ANG, “Lpc parameters substitution for speech information hiding,”The Journal of China Universities of Posts and Telecommunications, vol. 16, no. 6, pp. 103–112, 2009. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S1005888508602952

  24. [26]

    Improving security of quantization-index- modulation steganography in low bit-rate speech streams,

    H. Tian, J. Liu, and S. Li, “Improving security of quantization-index- modulation steganography in low bit-rate speech streams,”Multimedia Syst., vol. 20, no. 2, p. 143–154, Mar. 2014. [Online]. Available: https://doi.org/10.1007/s00530-013-0302-8

  25. [27]

    Spm: estimating payload locations of qim-based steganography in low-bit-rate compressed speeches,

    C. Zhang, S. Jiang, and Z. Chen, “Spm: estimating payload locations of qim-based steganography in low-bit-rate compressed speeches,” Multimedia Tools and Applications, vol. 83, no. 37, pp. 85 227–85 252,

  26. [28]

    Available: https://doi.org/10.1007/s11042-024-19501-4

    [Online]. Available: https://doi.org/10.1007/s11042-024-19501-4

  27. [29]

    Data hiding in pitch delay data of the adaptive multi-rate narrow-band speech codec,

    A. Nishimura, “Data hiding in pitch delay data of the adaptive multi-rate narrow-band speech codec,” in2009 fifth international conference on intelligent information hiding and multimedia signal processing. IEEE, 2009, pp. 483–486

  28. [30]

    Steganography integration into a low-bit rate speech codec,

    Y . Huang, C. Liu, S. Tang, and S. Bai, “Steganography integration into a low-bit rate speech codec,”IEEE transactions on information forensics and security, vol. 7, no. 6, pp. 1865–1875, 2012

  29. [31]

    Pitch-based steganography for speex voice codec,

    A. Janicki, “Pitch-based steganography for speex voice codec,”Security and communication networks, vol. 9, no. 15, pp. 2923–2933, 2016

  30. [32]

    A method of speech information hiding in inactive frame based on pitch modulation,

    Z. Wu, C. Zhang, and J. Guo, “A method of speech information hiding in inactive frame based on pitch modulation,”International Journal of Information and Computer Security, vol. 22, no. 1, pp. 1–27, 2023

  31. [33]

    Ahcm: Adaptive huffman code mapping for audio steganography based on psychoacoustic model,

    X. Yi, K. Yang, X. Zhao, Y . Wang, and H. Yu, “Ahcm: Adaptive huffman code mapping for audio steganography based on psychoacoustic model,” IEEE Transactions on Information Forensics and Security, vol. 14, no. 8, pp. 2217–2231, 2019

  32. [34]

    Mdi2: A high-dimensional feature for voip steganalysis,

    M. Qiao, W. Zhang, T. Zhanget al., “Mdi2: A high-dimensional feature for voip steganalysis,”IEEE Transactions on Information Forensics and Security, vol. 14, no. 10, pp. 2589–2603, 2019

  33. [35]

    Co-occurrence matrix based steganal- ysis for low-bit-rate voip streams,

    Y . Liu, Y . Huang, and X. Zhang, “Co-occurrence matrix based steganal- ysis for low-bit-rate voip streams,” inProceedings of the ACM Workshop on Information Hiding and Multimedia Security, 2020, pp. 45–56

  34. [36]

    On large-batch training for deep learning: Generalization gap and sharp minima,

    N. S. Keskar, D. Mudigere, J. Nocedalet al., “On large-batch training for deep learning: Generalization gap and sharp minima,”International Conference on Learning Representations (ICLR), 2017

  35. [37]

    A discriminative feature learning approach for deep face recognition,

    Y . Wen, K. Zhang, Z. Li, and Y . Qiao, “A discriminative feature learning approach for deep face recognition,” inEuropean conference on computer vision. Springer, 2016, pp. 499–515

  36. [38]

    Unsupervised learning of visual features by contrasting cluster assign- ments,

    M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assign- ments,”Advances in neural information processing systems, vol. 33, pp. 9912–9924, 2020

  37. [39]

    A theory of learning from different domains,

    S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, “A theory of learning from different domains,”Machine Learning, vol. 79, no. 1, pp. 151–175, 2010

  38. [40]

    Visualizing the loss landscape of neural nets,

    H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” inAdvances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/20...