CNN Models for Microphone Array Covariance Matrix Upsampling and Acoustic Imaging
Pith reviewed 2026-07-03 18:52 UTC · model grok-4.3
The pith
Neural networks can upsample 4-microphone covariance matrices to produce acoustic images matching a 32-microphone array.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Convolutional models can learn a mapping from the 4-channel covariance representation to the full 32-channel covariance matrix such that delay-and-sum beamforming on the estimated matrix produces sound source maps that closely resemble the maps obtained from the actual 32-channel array.
What carries the argument
2D convolutional layers with frequency dynamic convolution that extract spatial-spectral structure and frequency-dependent properties from covariance matrices.
If this is right
- Covariance upsampling raises the effective spatial resolution of a 4-channel array toward that of a 32-channel array.
- Beamforming images computed from the estimated 32-channel matrices visually match those from the real 32-channel array.
- All five tested architectures reduce RMSE relative to the random-guess baseline of 0.548.
- The best architecture reaches an RMSE of 0.432 on the test data.
Where Pith is reading between the lines
- The approach could enable higher-resolution acoustic imaging on devices that can only carry a small number of microphones.
- If the same network generalizes across array shapes, it would reduce the need to redesign hardware for each new application.
- Real-time versions of these models could support live sound-source tracking on resource-limited platforms.
Load-bearing premise
The mapping learned from STARSS23 training recordings will produce covariance estimates that remain accurate for beamforming on acoustic scenes and array placements never seen during training.
What would settle it
Apply the best trained model to a held-out recording set from a different acoustic environment or with a different 4-microphone geometry and measure whether the RMSE stays below 0.45 and whether the resulting beamforming heatmaps still match the 32-channel ground truth.
Figures
read the original abstract
Acoustic imaging visualization is a core methodology in acoustics, enabling spatial analysis of sound sources and acoustic scenes. However, limited sensor availability in practical systems motivate approaches that enhance spatial resolution without increasing the hardware complexity. In this paper, we focus on upsampling virtually a tetrahedral 4-microphone array to a spherical 32-microphone array by estimating the covariance matrices of the channels employing deep learning techniques. Five neural network architectures are investigated for covariance upsampling for acoustic imaging using the real-world STARSS23 dataset. These models are developed to estimate a 32-microphone, time-frequency covariance matrix from a 4-microphone input covariance representation. The proposed architectures are based on 2D convolutional layers to capture the underlying spatial-spectral structure of covariance matrices, and are further enhanced with frequency dynamic convolution to model their frequency-dependent properties. The proposed architectures are evaluated in terms of root mean square error (RMSE) and using delay-and-sum beamforming acoustic imaging. Quantitative results show that all models outperform a random-guess baseline, which yields an RMSE of 0.548, with the best-performing architecture achieving an RMSE of 0.432. We analyze qualitatively the performance of the proposed models through beamforming heatmap visualizations derived from the 4-channel input covariance, the 32-channel ground truth, and the predicted 32-channel covariance matrices. These results demonstrate that covariance upsampling significantly enhances the effective performance of the 4-channel microphone array, producing sound maps that closely resemble those obtained with the 32-channel array.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates five CNN architectures (2D convolutional layers augmented with frequency dynamic convolution) to upsample time-frequency covariance matrices from a 4-channel tetrahedral microphone array to a 32-channel spherical array. Using the STARSS23 real-world dataset, the models are trained to predict the larger covariance from the smaller input; evaluation uses RMSE against ground-truth 32-channel covariances and qualitative comparison of delay-and-sum beamforming acoustic images. The abstract reports that all models beat a random baseline (RMSE 0.548), with the best model reaching 0.432, and that the resulting beamforming maps closely resemble those from the full 32-channel array.
Significance. If the reported RMSE reduction and beamforming agreement prove robust, the work would demonstrate a practical route to increasing effective spatial resolution in acoustic imaging without additional hardware. The use of a public real-world dataset and direct beamforming evaluation are positive aspects. However, the absence of cross-dataset or cross-geometry validation means the significance is currently limited to the specific STARSS23 tetrahedral-to-spherical setting.
major comments (2)
- [Experiments / Results] Experiments / Results sections: the central quantitative claim (RMSE 0.432 vs. 0.548 random baseline) is presented without any description of the train/validation/test split ratios, number of scenes or time-frequency bins, regularization strategy, or statistical significance testing across runs. This information is load-bearing for interpreting whether the improvement reflects genuine upsampling capability or dataset-specific fitting.
- [Evaluation] Evaluation section: all reported results are confined to held-out portions of STARSS23 with fixed tetrahedral 4-mic to spherical 32-mic geometry. No experiments on other datasets, different room acoustics, or altered array placements are described, leaving the generalization assumption stated in the abstract untested and therefore weakening the claim that covariance upsampling “significantly enhances the effective performance of the 4-channel microphone array” in general.
minor comments (2)
- [Abstract] Abstract: the construction of the “random-guess baseline” (RMSE 0.548) is not defined; a brief statement of how the random covariance matrices are generated would improve clarity.
- [Figures] Figure captions for the beamforming heatmaps should explicitly label each panel as 4-channel input, 32-channel ground truth, or model prediction to facilitate direct visual comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve clarity and scope where possible.
read point-by-point responses
-
Referee: [Experiments / Results] Experiments / Results sections: the central quantitative claim (RMSE 0.432 vs. 0.548 random baseline) is presented without any description of the train/validation/test split ratios, number of scenes or time-frequency bins, regularization strategy, or statistical significance testing across runs. This information is load-bearing for interpreting whether the improvement reflects genuine upsampling capability or dataset-specific fitting.
Authors: We agree that these details are essential and were omitted. In the revised manuscript we will expand the Experiments section to report the train/validation/test split (70/15/15 on STARSS23 scenes), the number of scenes and TF bins processed, the regularization approach (L2 weight decay of 1e-5 together with dropout of 0.3), and results averaged over five random seeds with standard deviations and paired t-test p-values against the baseline. revision: yes
-
Referee: [Evaluation] Evaluation section: all reported results are confined to held-out portions of STARSS23 with fixed tetrahedral 4-mic to spherical 32-mic geometry. No experiments on other datasets, different room acoustics, or altered array placements are described, leaving the generalization assumption stated in the abstract untested and therefore weakening the claim that covariance upsampling “significantly enhances the effective performance of the 4-channel microphone array” in general.
Authors: We acknowledge the limitation. We will revise the abstract to qualify the performance claim as applying to the STARSS23 tetrahedral-to-spherical setting and will add a dedicated paragraph in the Discussion section that explicitly states the absence of cross-dataset or cross-geometry validation. Because new experiments on additional datasets lie outside the scope of the present study, we cannot supply such results in the revision. revision: partial
- Cross-dataset or cross-geometry experiments, which would require new data acquisition and training runs not feasible within the revision period.
Circularity Check
No circularity; standard empirical ML evaluation on held-out test data from public dataset
full rationale
The paper trains CNN architectures on the STARSS23 training split to map 4-channel covariance matrices to 32-channel estimates, then reports RMSE and beamforming results on the held-out test split. No equations, derivations, or self-citations are presented that reduce the reported test RMSE (0.432 vs. random baseline 0.548) to a fitted parameter or input by construction. Evaluation uses independent test data and external metrics (RMSE, delay-and-sum imaging), satisfying the criteria for non-circular empirical work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The covariance matrices of the 4-mic and 32-mic arrays share a learnable spatial-spectral structure that can be recovered by 2D CNNs.
Reference graph
Works this paper leans on
-
[1]
T. Rittenschober and R. Karrer, “High-speed acoustic imaging for the localization of impulse-like sound emissions from automotive com- ponents,” in13th International Styrian Noise, Vibration & Harshness Congress: The European Automotive Noise Conference, 2024
work page 2024
-
[2]
R. Merino-Mart ´ınez, P. Sijtsma, M. Snellen, T. Ahlefeldt, J. Antoni, C. J. Bahr, D. Blacodon, D. Ernst, A. Finez, S. Funkeet al., “A review of acoustic imaging methods using phased microphone arrays: Part of the “aircraft noise generation and assessment” special issue,”CEAS Aeronautical Journal, vol. 10, no. 1, pp. 197–230, 2019
work page 2019
-
[3]
Deciphering complex coral reef soundscapes with spatial audio and 360° video,
M. S. Dantzker, M. T. Duggan, E. Berlik, S. Delikaris-Manias, V . Boun- tourakis, V . Pulkki, and A. N. Rice, “Deciphering complex coral reef soundscapes with spatial audio and 360° video,”Methods in Ecology and Evolution, vol. 16, no. 11, pp. 2622–2637, 2025
work page 2025
-
[4]
Acoustic-signal-based damage detec- tion of wind turbine blades—a review,
S. Ding, C. Yang, and S. Zhang, “Acoustic-signal-based damage detec- tion of wind turbine blades—a review,”Sensors, vol. 23, no. 11, p. 4987, 2023
work page 2023
-
[5]
Partial discharge detection using acoustic camera,
J. Pihera, J. Hornak, P. Trnka, O. Turecek, L. Zuzjak, K. Saksela, J. Nyberg, and R. Albrecht, “Partial discharge detection using acoustic camera,” in2020 IEEE 3rd International Conference on Dielectrics (ICD). IEEE, 2020, pp. 830–833
work page 2020
-
[6]
Usage of acoustic camera for condition monitoring of electric motors,
M. Orman and C. T. Pinto, “Usage of acoustic camera for condition monitoring of electric motors,” in2013 IEEE International Conference of IEEE Region 10 (TENCON 2013). IEEE, 2013, pp. 1–4
work page 2013
-
[7]
Z. Xu and Z. Chen, “Investigation of noise characteristics of electric vibrator utilizing acoustic camera and transfer path analysis,”Measure- ment Science and Technology, vol. 35, no. 11, p. 116012, 2024
work page 2024
-
[8]
Use of acoustic camera for noise sources localization and noise reduction in the industrial plant,
W. Fiebig and D. Dabrowski, “Use of acoustic camera for noise sources localization and noise reduction in the industrial plant,”Archives of Acoustics, vol. 45, no. 1, pp. 111–117, 2020
work page 2020
-
[10]
F. Miotello, F. Terminiello, M. Pezzoli, A. Bernardini, F. Antonacci, and A. Sarti, “A physics-informed neural network-based approach for the spatial upsampling of spherical microphone arrays,” in2024 18th International Workshop on Acoustic Signal Enhancement (IWAENC), 2024, pp. 215–219
work page 2024
-
[11]
T. F. Brooks and W. M. Humphreys, “A deconvolution approach for the mapping of acoustic sources (damas) determined from phased microphone arrays,”Journal of sound and vibration, vol. 294, no. 4- 5, pp. 856–879, 2006
work page 2006
-
[12]
Multiple emitter location and signal parameter estimation,
R. Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE transactions on antennas and propagation, vol. 34, no. 3, pp. 276– 280, 1986
work page 1986
-
[13]
R. P. Dougherty, “Functional beamforming,” in5th Berlin beamforming conference. GFaI, eV Berlin, 2014, pp. 19–20
work page 2014
-
[14]
L. McCormack, A. Politis, and V . Pulkki, “Sharpening of angular spectra based on a directional re-assignment approach for ambisonic sound-field visualisation,” inICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 576–580
work page 2019
-
[15]
The application of compressive sampling to the analysis and synthesis of spatial sound fields,
N. Epain, C. Jin, and A. Van Schaik, “The application of compressive sampling to the analysis and synthesis of spatial sound fields,” inAudio Engineering Society Convention 127. Audio Engineering Society, 2009
work page 2009
-
[16]
C. D. Salvador, S. Sakamoto, J. Trevi ˜no, and Y . Suzuki, “Enhancing bin- aural reconstruction from rigid circular microphone array recordings by using virtual microphones,” inAudio Engineering Society Conference: 2018 AES International Conference on Audio for Virtual and Augmented Reality. Audio Engineering Society, 2018
work page 2018
-
[17]
Spatial upsampling of sparse spherical microphone array signals,
T. L ¨ubeck, J. M. Arend, and C. P ¨orschmann, “Spatial upsampling of sparse spherical microphone array signals,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1163–1174, 2023
work page 2023
-
[18]
Upscaling ambisonic sound scenes using compressed sensing techniques,
A. Wabnitz, N. Epain, A. McEwan, and C. Jin, “Upscaling ambisonic sound scenes using compressed sensing techniques,” in2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2011
work page 2011
-
[19]
Deep-sound field analysis for upscaling ambisonic signals,
G. Routray, S. Basu, P. Baldev, and R. M. Hegde, “Deep-sound field analysis for upscaling ambisonic signals,” inEAA Spatial Audio Signal Processing Symposium, 2019
work page 2019
-
[20]
Higher-order ambisonics upscaling using gated recurrent units,
E. Chatzimoustafa and P. Jax, “Higher-order ambisonics upscaling using gated recurrent units,” in2025 33st European Signal Processing Conference (EUSIPCO), 2025
work page 2025
-
[21]
Robust doa estimation from deep acoustic imaging,
A. S. Roman, I. R. Roman, and J. P. Bello, “Robust doa estimation from deep acoustic imaging,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 1321–1325
work page 2024
-
[22]
Deep back-projection networks for super-resolution,
M. Haris, G. Shakhnarovich, and N. Ukita, “Deep back-projection networks for super-resolution,” inIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
work page 2018
-
[23]
Deepwave: a recurrent neural-network for real-time acoustic imaging,
M. Simeoni, S. Kashani, P. Hurley, and M. Vetterli, “Deepwave: a recurrent neural-network for real-time acoustic imaging,”Advances In Neural Information Processing Systems, vol. 32, 2019
work page 2019
-
[24]
Latent acoustic mapping for direction of arrival estimation: A self-supervised approach,
A. S. Roman, I. R. Roman, and J. P. Bello, “Latent acoustic mapping for direction of arrival estimation: A self-supervised approach,” in2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2025
work page 2025
-
[25]
Neural ambisonics encoding for compact irregular microphone arrays,
M. Heikkinen, A. Politis, and T. Virtanen, “Neural ambisonics encoding for compact irregular microphone arrays,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 701–705
work page 2024
-
[26]
K. Shimada, A. Politis, P. Sudarsanam, D. A. Krause, K. Uchida, S. Adavanne, A. Hakala, Y . Koyama, N. Takahashi, S. Takahashiet al., “STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,”Advances in Neural Information Processing Systems, vol. 36, pp. 72 931–72 957, 2023
work page 2023
-
[27]
Diaz-Guerra, A. Politis, K. S. Parthasaarathy Sudarsanam, D. A. Krause, K. Uchida, N. T. Yuichiro Koyama, S. Takahashi, T. Shibuya, Y . Mit- sufuji, and T. Virtanen, “Baseline models and evaluation of sound event localization and detection with distance estimation in DCASE2024 challenge,” inWorkshop on Detection and Classification of Acoustic Scenes and E...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.