pith. sign in

arxiv: 2607.01295 · v1 · pith:525APJNXnew · submitted 2026-07-01 · 📡 eess.AS · cs.LG· cs.SD· eess.SP

CNN Models for Microphone Array Covariance Matrix Upsampling and Acoustic Imaging

Pith reviewed 2026-07-03 18:52 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SDeess.SP
keywords covariance matrix upsamplingmicrophone arrayacoustic imagingconvolutional neural networksbeamformingdelay-and-sumSTARSS23
0
0 comments X

The pith

Neural networks can upsample 4-microphone covariance matrices to produce acoustic images matching a 32-microphone array.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests five convolutional neural network architectures that take the covariance matrix from a tetrahedral 4-microphone array and estimate the covariance matrix of a spherical 32-microphone array. The goal is to improve spatial resolution in acoustic imaging while keeping the physical hardware small. The models use 2D convolutions to capture spatial-spectral patterns in the matrices and frequency dynamic convolution to handle frequency dependence. On the STARSS23 dataset the networks beat a random baseline in root-mean-square error, and delay-and-sum beamforming on the estimated matrices yields sound maps visually close to those from the full 32-channel array.

Core claim

Convolutional models can learn a mapping from the 4-channel covariance representation to the full 32-channel covariance matrix such that delay-and-sum beamforming on the estimated matrix produces sound source maps that closely resemble the maps obtained from the actual 32-channel array.

What carries the argument

2D convolutional layers with frequency dynamic convolution that extract spatial-spectral structure and frequency-dependent properties from covariance matrices.

If this is right

  • Covariance upsampling raises the effective spatial resolution of a 4-channel array toward that of a 32-channel array.
  • Beamforming images computed from the estimated 32-channel matrices visually match those from the real 32-channel array.
  • All five tested architectures reduce RMSE relative to the random-guess baseline of 0.548.
  • The best architecture reaches an RMSE of 0.432 on the test data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could enable higher-resolution acoustic imaging on devices that can only carry a small number of microphones.
  • If the same network generalizes across array shapes, it would reduce the need to redesign hardware for each new application.
  • Real-time versions of these models could support live sound-source tracking on resource-limited platforms.

Load-bearing premise

The mapping learned from STARSS23 training recordings will produce covariance estimates that remain accurate for beamforming on acoustic scenes and array placements never seen during training.

What would settle it

Apply the best trained model to a held-out recording set from a different acoustic environment or with a different 4-microphone geometry and measure whether the RMSE stays below 0.45 and whether the resulting beamforming heatmaps still match the 32-channel ground truth.

Figures

Figures reproduced from arXiv: 2607.01295 by Archontis Politis, David Diaz-Guerra, Jan Lundgren, Marianthi Adamopoulou, Meng Jiang, Parthasaarathy Sudarsanam, Seyed Jalaleddin Mousavirad, Tuomas Virtanen.

Figure 1
Figure 1. Figure 1: model architectures. (a) Base CNN, (b) Expanded CNN, (c) Hybrid FDC-CNN Base, (d) Hybrid FDC-CNN Expanded, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Frequency average of the SCMs for a single time frame. Left is the 4-channel input SCM, in the middle is the ground [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: RMSE loss per frequency (log. scale) for each proposed [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Beamformed sound heatmaps for (a) 4-channel, (b) [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Acoustic imaging visualization is a core methodology in acoustics, enabling spatial analysis of sound sources and acoustic scenes. However, limited sensor availability in practical systems motivate approaches that enhance spatial resolution without increasing the hardware complexity. In this paper, we focus on upsampling virtually a tetrahedral 4-microphone array to a spherical 32-microphone array by estimating the covariance matrices of the channels employing deep learning techniques. Five neural network architectures are investigated for covariance upsampling for acoustic imaging using the real-world STARSS23 dataset. These models are developed to estimate a 32-microphone, time-frequency covariance matrix from a 4-microphone input covariance representation. The proposed architectures are based on 2D convolutional layers to capture the underlying spatial-spectral structure of covariance matrices, and are further enhanced with frequency dynamic convolution to model their frequency-dependent properties. The proposed architectures are evaluated in terms of root mean square error (RMSE) and using delay-and-sum beamforming acoustic imaging. Quantitative results show that all models outperform a random-guess baseline, which yields an RMSE of 0.548, with the best-performing architecture achieving an RMSE of 0.432. We analyze qualitatively the performance of the proposed models through beamforming heatmap visualizations derived from the 4-channel input covariance, the 32-channel ground truth, and the predicted 32-channel covariance matrices. These results demonstrate that covariance upsampling significantly enhances the effective performance of the 4-channel microphone array, producing sound maps that closely resemble those obtained with the 32-channel array.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates five CNN architectures (2D convolutional layers augmented with frequency dynamic convolution) to upsample time-frequency covariance matrices from a 4-channel tetrahedral microphone array to a 32-channel spherical array. Using the STARSS23 real-world dataset, the models are trained to predict the larger covariance from the smaller input; evaluation uses RMSE against ground-truth 32-channel covariances and qualitative comparison of delay-and-sum beamforming acoustic images. The abstract reports that all models beat a random baseline (RMSE 0.548), with the best model reaching 0.432, and that the resulting beamforming maps closely resemble those from the full 32-channel array.

Significance. If the reported RMSE reduction and beamforming agreement prove robust, the work would demonstrate a practical route to increasing effective spatial resolution in acoustic imaging without additional hardware. The use of a public real-world dataset and direct beamforming evaluation are positive aspects. However, the absence of cross-dataset or cross-geometry validation means the significance is currently limited to the specific STARSS23 tetrahedral-to-spherical setting.

major comments (2)
  1. [Experiments / Results] Experiments / Results sections: the central quantitative claim (RMSE 0.432 vs. 0.548 random baseline) is presented without any description of the train/validation/test split ratios, number of scenes or time-frequency bins, regularization strategy, or statistical significance testing across runs. This information is load-bearing for interpreting whether the improvement reflects genuine upsampling capability or dataset-specific fitting.
  2. [Evaluation] Evaluation section: all reported results are confined to held-out portions of STARSS23 with fixed tetrahedral 4-mic to spherical 32-mic geometry. No experiments on other datasets, different room acoustics, or altered array placements are described, leaving the generalization assumption stated in the abstract untested and therefore weakening the claim that covariance upsampling “significantly enhances the effective performance of the 4-channel microphone array” in general.
minor comments (2)
  1. [Abstract] Abstract: the construction of the “random-guess baseline” (RMSE 0.548) is not defined; a brief statement of how the random covariance matrices are generated would improve clarity.
  2. [Figures] Figure captions for the beamforming heatmaps should explicitly label each panel as 4-channel input, 32-channel ground truth, or model prediction to facilitate direct visual comparison.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve clarity and scope where possible.

read point-by-point responses
  1. Referee: [Experiments / Results] Experiments / Results sections: the central quantitative claim (RMSE 0.432 vs. 0.548 random baseline) is presented without any description of the train/validation/test split ratios, number of scenes or time-frequency bins, regularization strategy, or statistical significance testing across runs. This information is load-bearing for interpreting whether the improvement reflects genuine upsampling capability or dataset-specific fitting.

    Authors: We agree that these details are essential and were omitted. In the revised manuscript we will expand the Experiments section to report the train/validation/test split (70/15/15 on STARSS23 scenes), the number of scenes and TF bins processed, the regularization approach (L2 weight decay of 1e-5 together with dropout of 0.3), and results averaged over five random seeds with standard deviations and paired t-test p-values against the baseline. revision: yes

  2. Referee: [Evaluation] Evaluation section: all reported results are confined to held-out portions of STARSS23 with fixed tetrahedral 4-mic to spherical 32-mic geometry. No experiments on other datasets, different room acoustics, or altered array placements are described, leaving the generalization assumption stated in the abstract untested and therefore weakening the claim that covariance upsampling “significantly enhances the effective performance of the 4-channel microphone array” in general.

    Authors: We acknowledge the limitation. We will revise the abstract to qualify the performance claim as applying to the STARSS23 tetrahedral-to-spherical setting and will add a dedicated paragraph in the Discussion section that explicitly states the absence of cross-dataset or cross-geometry validation. Because new experiments on additional datasets lie outside the scope of the present study, we cannot supply such results in the revision. revision: partial

standing simulated objections not resolved
  • Cross-dataset or cross-geometry experiments, which would require new data acquisition and training runs not feasible within the revision period.

Circularity Check

0 steps flagged

No circularity; standard empirical ML evaluation on held-out test data from public dataset

full rationale

The paper trains CNN architectures on the STARSS23 training split to map 4-channel covariance matrices to 32-channel estimates, then reports RMSE and beamforming results on the held-out test split. No equations, derivations, or self-citations are presented that reduce the reported test RMSE (0.432 vs. random baseline 0.548) to a fitted parameter or input by construction. Evaluation uses independent test data and external metrics (RMSE, delay-and-sum imaging), satisfying the criteria for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated beyond standard supervised learning assumptions.

axioms (1)
  • domain assumption The covariance matrices of the 4-mic and 32-mic arrays share a learnable spatial-spectral structure that can be recovered by 2D CNNs.
    Implicit in the choice of 2D convolutional architecture for the upsampling task.

pith-pipeline@v0.9.1-grok · 5850 in / 1317 out tokens · 27936 ms · 2026-07-03T18:52:59.256424+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    High-speed acoustic imaging for the localization of impulse-like sound emissions from automotive com- ponents,

    T. Rittenschober and R. Karrer, “High-speed acoustic imaging for the localization of impulse-like sound emissions from automotive com- ponents,” in13th International Styrian Noise, Vibration & Harshness Congress: The European Automotive Noise Conference, 2024

  2. [2]

    A review of acoustic imaging methods using phased microphone arrays: Part of the “aircraft noise generation and assessment

    R. Merino-Mart ´ınez, P. Sijtsma, M. Snellen, T. Ahlefeldt, J. Antoni, C. J. Bahr, D. Blacodon, D. Ernst, A. Finez, S. Funkeet al., “A review of acoustic imaging methods using phased microphone arrays: Part of the “aircraft noise generation and assessment” special issue,”CEAS Aeronautical Journal, vol. 10, no. 1, pp. 197–230, 2019

  3. [3]

    Deciphering complex coral reef soundscapes with spatial audio and 360° video,

    M. S. Dantzker, M. T. Duggan, E. Berlik, S. Delikaris-Manias, V . Boun- tourakis, V . Pulkki, and A. N. Rice, “Deciphering complex coral reef soundscapes with spatial audio and 360° video,”Methods in Ecology and Evolution, vol. 16, no. 11, pp. 2622–2637, 2025

  4. [4]

    Acoustic-signal-based damage detec- tion of wind turbine blades—a review,

    S. Ding, C. Yang, and S. Zhang, “Acoustic-signal-based damage detec- tion of wind turbine blades—a review,”Sensors, vol. 23, no. 11, p. 4987, 2023

  5. [5]

    Partial discharge detection using acoustic camera,

    J. Pihera, J. Hornak, P. Trnka, O. Turecek, L. Zuzjak, K. Saksela, J. Nyberg, and R. Albrecht, “Partial discharge detection using acoustic camera,” in2020 IEEE 3rd International Conference on Dielectrics (ICD). IEEE, 2020, pp. 830–833

  6. [6]

    Usage of acoustic camera for condition monitoring of electric motors,

    M. Orman and C. T. Pinto, “Usage of acoustic camera for condition monitoring of electric motors,” in2013 IEEE International Conference of IEEE Region 10 (TENCON 2013). IEEE, 2013, pp. 1–4

  7. [7]

    Investigation of noise characteristics of electric vibrator utilizing acoustic camera and transfer path analysis,

    Z. Xu and Z. Chen, “Investigation of noise characteristics of electric vibrator utilizing acoustic camera and transfer path analysis,”Measure- ment Science and Technology, vol. 35, no. 11, p. 116012, 2024

  8. [8]

    Use of acoustic camera for noise sources localization and noise reduction in the industrial plant,

    W. Fiebig and D. Dabrowski, “Use of acoustic camera for noise sources localization and noise reduction in the industrial plant,”Archives of Acoustics, vol. 45, no. 1, pp. 111–117, 2020

  9. [10]

    A physics-informed neural network-based approach for the spatial upsampling of spherical microphone arrays,

    F. Miotello, F. Terminiello, M. Pezzoli, A. Bernardini, F. Antonacci, and A. Sarti, “A physics-informed neural network-based approach for the spatial upsampling of spherical microphone arrays,” in2024 18th International Workshop on Acoustic Signal Enhancement (IWAENC), 2024, pp. 215–219

  10. [11]

    A deconvolution approach for the mapping of acoustic sources (damas) determined from phased microphone arrays,

    T. F. Brooks and W. M. Humphreys, “A deconvolution approach for the mapping of acoustic sources (damas) determined from phased microphone arrays,”Journal of sound and vibration, vol. 294, no. 4- 5, pp. 856–879, 2006

  11. [12]

    Multiple emitter location and signal parameter estimation,

    R. Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE transactions on antennas and propagation, vol. 34, no. 3, pp. 276– 280, 1986

  12. [13]

    Functional beamforming,

    R. P. Dougherty, “Functional beamforming,” in5th Berlin beamforming conference. GFaI, eV Berlin, 2014, pp. 19–20

  13. [14]

    Sharpening of angular spectra based on a directional re-assignment approach for ambisonic sound-field visualisation,

    L. McCormack, A. Politis, and V . Pulkki, “Sharpening of angular spectra based on a directional re-assignment approach for ambisonic sound-field visualisation,” inICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 576–580

  14. [15]

    The application of compressive sampling to the analysis and synthesis of spatial sound fields,

    N. Epain, C. Jin, and A. Van Schaik, “The application of compressive sampling to the analysis and synthesis of spatial sound fields,” inAudio Engineering Society Convention 127. Audio Engineering Society, 2009

  15. [16]

    Enhancing bin- aural reconstruction from rigid circular microphone array recordings by using virtual microphones,

    C. D. Salvador, S. Sakamoto, J. Trevi ˜no, and Y . Suzuki, “Enhancing bin- aural reconstruction from rigid circular microphone array recordings by using virtual microphones,” inAudio Engineering Society Conference: 2018 AES International Conference on Audio for Virtual and Augmented Reality. Audio Engineering Society, 2018

  16. [17]

    Spatial upsampling of sparse spherical microphone array signals,

    T. L ¨ubeck, J. M. Arend, and C. P ¨orschmann, “Spatial upsampling of sparse spherical microphone array signals,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1163–1174, 2023

  17. [18]

    Upscaling ambisonic sound scenes using compressed sensing techniques,

    A. Wabnitz, N. Epain, A. McEwan, and C. Jin, “Upscaling ambisonic sound scenes using compressed sensing techniques,” in2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2011

  18. [19]

    Deep-sound field analysis for upscaling ambisonic signals,

    G. Routray, S. Basu, P. Baldev, and R. M. Hegde, “Deep-sound field analysis for upscaling ambisonic signals,” inEAA Spatial Audio Signal Processing Symposium, 2019

  19. [20]

    Higher-order ambisonics upscaling using gated recurrent units,

    E. Chatzimoustafa and P. Jax, “Higher-order ambisonics upscaling using gated recurrent units,” in2025 33st European Signal Processing Conference (EUSIPCO), 2025

  20. [21]

    Robust doa estimation from deep acoustic imaging,

    A. S. Roman, I. R. Roman, and J. P. Bello, “Robust doa estimation from deep acoustic imaging,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 1321–1325

  21. [22]

    Deep back-projection networks for super-resolution,

    M. Haris, G. Shakhnarovich, and N. Ukita, “Deep back-projection networks for super-resolution,” inIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  22. [23]

    Deepwave: a recurrent neural-network for real-time acoustic imaging,

    M. Simeoni, S. Kashani, P. Hurley, and M. Vetterli, “Deepwave: a recurrent neural-network for real-time acoustic imaging,”Advances In Neural Information Processing Systems, vol. 32, 2019

  23. [24]

    Latent acoustic mapping for direction of arrival estimation: A self-supervised approach,

    A. S. Roman, I. R. Roman, and J. P. Bello, “Latent acoustic mapping for direction of arrival estimation: A self-supervised approach,” in2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2025

  24. [25]

    Neural ambisonics encoding for compact irregular microphone arrays,

    M. Heikkinen, A. Politis, and T. Virtanen, “Neural ambisonics encoding for compact irregular microphone arrays,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 701–705

  25. [26]

    STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,

    K. Shimada, A. Politis, P. Sudarsanam, D. A. Krause, K. Uchida, S. Adavanne, A. Hakala, Y . Koyama, N. Takahashi, S. Takahashiet al., “STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,”Advances in Neural Information Processing Systems, vol. 36, pp. 72 931–72 957, 2023

  26. [27]

    Baseline models and evaluation of sound event localization and detection with distance estimation in DCASE2024 challenge,

    Diaz-Guerra, A. Politis, K. S. Parthasaarathy Sudarsanam, D. A. Krause, K. Uchida, N. T. Yuichiro Koyama, S. Takahashi, T. Shibuya, Y . Mit- sufuji, and T. Virtanen, “Baseline models and evaluation of sound event localization and detection with distance estimation in DCASE2024 challenge,” inWorkshop on Detection and Classification of Acoustic Scenes and E...