Perceptual Evaluation of Higher-Order Ambisonic Codecs on Both Synthetic Mixing and Native Recordings
Pith reviewed 2026-06-25 22:23 UTC · model grok-4.3
The pith
IVAS codec for higher-order ambisonics outperforms multi-mono encoding at the same bitrate by exploiting inter-channel correlations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The IVAS codec achieves superior perceptual quality to multi-mono HOA coding at the same bitrate by exploiting inter-channel correlation, with the performance gap largest on signals composed of few plane waves.
What carries the argument
IVAS codec's use of inter-channel correlation to reduce bitrate while preserving perceptual quality in higher-order ambisonics.
If this is right
- IVAS supports lower bitrates for equivalent quality in correlated HOA signals.
- Multi-mono encoding wastes bitrate on highly correlated content such as few-plane-wave scenes.
- IVAS is especially suitable for communication use cases involving limited numbers of sound sources.
- Perceptual tests across synthetic and native material confirm the correlation benefit holds for multiple spatialization methods.
Where Pith is reading between the lines
- Real-time VR and AR systems could adopt IVAS to lower transmission costs without quality loss on typical scene content.
- Codec selection for spatial audio may need to account for expected inter-channel correlation rather than treating all HOA signals uniformly.
- Extending the comparison to higher orders or dynamic scenes would test whether the correlation advantage scales.
Load-bearing premise
The chosen contents, spatialization methods, and listening test conditions represent real-world HOA use in VR and AR applications.
What would settle it
A new listening test on a different set of contents or higher ambisonic orders in which IVAS shows no quality advantage over multi-mono at the same bitrate.
Figures
read the original abstract
Spatial audio is spreading in applications such as virtual and augmented reality and immersive games. The higher-order ambisonic (HOA) format is particularly useful in this context. Transmitting spatial information requires multiple channels, e.g., 16 channels for 3rd-order ambisonics, resulting in increased memory requirements for storage and higher bitrates for communication. Therefore, efficient compression algorithms are necessary for those contents. The recently standardized IVAS codec allows the coding of HOA content for communication use-cases. Here, we propose to evaluate it in comparison with a basic multi-mono approach across a variety of contents and spatialization methods. Results show that IVAS outperforms the multi-mono approach at the same bitrate. In particular, this codec exploits inter-channel correlation to reduce the bitrate. We point out that it is therefore especially robust for signals with a high interchannel correlation, such as those composed of a limited number of plane waves. Conversely, the multi-mono approach is unable to exploit this correlation and performs poorly on this type of signal.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a perceptual listening test comparing the standardized IVAS codec against a basic multi-mono HOA coding approach at matched bitrates. Using both synthetic mixtures and native recordings, it reports that IVAS yields higher perceptual quality by exploiting inter-channel correlation, with the advantage being largest for content composed of a small number of plane waves.
Significance. If the listening-test outcomes prove robust, the work supplies concrete evidence that correlation-aware HOA codecs can deliver measurable perceptual gains over independent-channel coding at the same bitrate. This is directly relevant to bitrate-constrained spatial-audio delivery in VR/AR and immersive communication.
major comments (1)
- [Abstract] Abstract: the performance result is stated without details on listener count, statistical tests, content selection criteria, or error analysis, so the data-to-claim link cannot be verified.
Simulated Author's Rebuttal
We thank the referee for the constructive comment and for recognizing the relevance of our work to bitrate-constrained spatial audio delivery. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the performance result is stated without details on listener count, statistical tests, content selection criteria, or error analysis, so the data-to-claim link cannot be verified.
Authors: We agree that the abstract would be strengthened by including key methodological details. The full manuscript reports a MUSHRA listening test with 12 expert listeners, statistical analysis via repeated-measures ANOVA with post-hoc pairwise comparisons (Bonferroni-corrected), content selected to span synthetic mixtures (controlled plane-wave counts from 1 to 8) and native HOA recordings, and results presented with 95% confidence intervals. We will revise the abstract to concisely incorporate listener count, mention of statistical testing, and a note on content diversity while preserving length constraints. revision: yes
Circularity Check
No significant circularity: purely empirical perceptual comparison
full rationale
The manuscript reports results from listening tests that directly compare perceptual quality of IVAS versus multi-mono HOA coding at matched bitrates across selected contents. No derivation, predictive model, fitted parameters, or mathematical claim is advanced whose output is asserted to follow from the inputs by construction. No equations, ansatzes, or uniqueness theorems appear; the central observation that IVAS exploits inter-channel correlation is presented as an empirical finding from the test data rather than a self-referential reduction. The work is therefore self-contained against external benchmarks with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Direct Comparison of the Impact of Head Tracking, Reverberation, and Individualized Head-Related Transfer Functions on the Spatial Perception of a Virtual Speech Source,
D. R. Begault, “Direct Comparison of the Impact of Head Tracking, Reverberation, and Individualized Head-Related Transfer Functions on the Spatial Perception of a Virtual Speech Source,”J. Audio Eng. Soc., vol. 49, no. 10, 2001
2001
-
[2]
Minimum BRIR grid resolution for dynamic binaural synthesis,
A. Lindau, H.-J. Maempel, and S. Weinzierl, “Minimum BRIR grid resolution for dynamic binaural synthesis,”J. Acous. Soc. America, vol. 123, May 2008
2008
-
[3]
Daniel,Repr ´esentation de champs acoustiques, appli- cation `a la transmission et `a la reproduction de sc `enes sonores complexes dans un contexte multim ´edia
J. Daniel,Repr ´esentation de champs acoustiques, appli- cation `a la transmission et `a la reproduction de sc `enes sonores complexes dans un contexte multim ´edia. PhD thesis, Univ. Paris 6, July 2001
2001
-
[4]
A 3D ambisonic based binaural sound reproduction system,
M. Noisternig, A. Sontacchi, T. Musil, and R. H¨oldrich, “A 3D ambisonic based binaural sound reproduction system,” in24th Int. Conf.: Multichannel Audio, The New Reality, Audio Eng. Soc., June 2003
2003
-
[5]
Binaural Rendering with Measured Room Responses: First-Order Ambisonic Microphone vs. Dummy Head,
M. Zaunschirm, M. Frank, and F. Zotter, “Binaural Rendering with Measured Room Responses: First-Order Ambisonic Microphone vs. Dummy Head,”Applied Sciences, vol. 10, Feb. 2020
2020
-
[6]
Binau- ral Rendering of Ambisonic Signals via Magnitude Least Squares,
C. Schorkhuber, M. Zaunschirm, and R. Holdrich, “Binau- ral Rendering of Ambisonic Signals via Magnitude Least Squares,” inProceedings of the DAGA, vol. 44, 2018
2018
-
[7]
Ambisonics Sound Source Localization With Varying Amount of Visual Information in Virtual Reality,
T. Huisman, A. Ahrens, and E. MacDonald, “Ambisonics Sound Source Localization With Varying Amount of Visual Information in Virtual Reality,”Frontiers in Virtual Reality, vol. 2, Oct. 2021
2021
-
[8]
Ambisonics in an Ogg Opus Container,
J. Skoglund and M. Graczyk, “Ambisonics in an Ogg Opus Container,” Tech. Rep. RFC 8486, Internet Engineering Task Force (IETF), Oct. 2018. https://www.rfc-editor.org/ rfc/rfc8486.txt
2018
-
[9]
Streaming VR for immersion: Quality aspects of compressed spatial audio,
M. Narbutt, S. O’Leary, A. Allen, J. Skoglund, and A. Hines, “Streaming VR for immersion: Quality aspects of compressed spatial audio,” in23rd Int. Conf. Virt. Sys. & Multimedia (VSMM), (Dublin), IEEE, Oct. 2017
2017
-
[10]
Rudzki,Improvements in the Perceived Quality of Streaming and Binaural Rendering of Ambisonics
T. Rudzki,Improvements in the Perceived Quality of Streaming and Binaural Rendering of Ambisonics. PhD thesis, Univ. of York, 2023
2023
-
[11]
MPEG-H 3D Audio—The New Standard for Coding of Immersive Spatial Audio,
J. Herre, J. Hilpert, A. Kuntz, and J. Plogsties, “MPEG-H 3D Audio—The New Standard for Coding of Immersive Spatial Audio,”IEEE J Sel. Top. in Sig. Proc., vol. 9, Aug. 2015
2015
-
[12]
RTP payload format and SDP parameter definitions (Release 18),” Nov
“3GPP TS 26.253 - Technical Specification Group Ser- vices and System Aspects; Codec for Immersive V oice and Audio Services; Detailed Algorithmic Description incl. RTP payload format and SDP parameter definitions (Release 18),” Nov. 2023. 3GPP TS 26.253
2023
-
[13]
Ambisonics Coding in IV AS: A Hybrid SPAR and DirAC System,
D. Weckbecker, S. Brown, J. Torres, M. Multrus, A. Tama- rapu, and G. Fuchs, “Ambisonics Coding in IV AS: A Hybrid SPAR and DirAC System,” inIEEE Int. Conf. Acous., Speech Sig. Proc. (ICASSP), (Hyderabad, India), Apr. 2025
2025
-
[14]
Immersive Audio Coding for Virtual Reality Using a Metadata-assisted Extension of the 3GPP EVS Codec,
D. McGrath, S. Bruhn, H. Purnhagen, M. Eckert, J. Torres, S. Brown, and D. Darcy, “Immersive Audio Coding for Virtual Reality Using a Metadata-assisted Extension of the 3GPP EVS Codec,” inIEEE Int. Conf. Acous., Speech Sig. Proc. (ICASSP), May 2019
2019
-
[15]
Reproducing applause-type signals with directional audio coding,
M.-V . Laitinen, F. Kuech, S. Disch, and V . Pulkki, “Reproducing applause-type signals with directional audio coding,”J. Audio Eng. Soc., vol. 59, 2011
2011
-
[16]
3GPP TR 26.997 - IV AS codec performance characteri- zation,
“3GPP TR 26.997 - IV AS codec performance characteri- zation,” tech. rep., July 2024
2024
-
[17]
Spatial redundancy in Higher Order Ambisonics and its use for lowdelay lossless compression,
E. Hellerud, A. Solvang, and U. P. Svensson, “Spatial redundancy in Higher Order Ambisonics and its use for lowdelay lossless compression,” inIEEE Int. Conf. Acous., Speech Sig. Proc., (Taipei, Taiwan), Apr. 2009
2009
-
[18]
Perceptually-motivated Spatial Audio Codec for Higher- Order Ambisonics Compression,
C. Hold, L. McCormack, A. Politis, and V . Pulkki, “Perceptually-motivated Spatial Audio Codec for Higher- Order Ambisonics Compression,” inIEEE Int. Conf. Acous., Speech Sig. Proc. (ICASSP), Jan. 2024
2024
-
[19]
Clarity Challenge - Task 3,
“Clarity Challenge - Task 3,” 2024. https:// claritychallenge.org/docs/cec3/task 3/cec3 task3 data
2024
-
[20]
ICASSP 2022 Deep Noise Suppression Challenge,
H. Dubey, V . Gopal, R. Cutler, A. Aazami, S. Matusevych, S. Braun, S. E. Eskimez, M. Thakker, T. Yoshioka, H. Gamper, and R. Aichner, “ICASSP 2022 Deep Noise Suppression Challenge,” Feb. 2022. arXiv:2202.13288 [eess]
-
[21]
ITU-R Rec. BS.1534-3: Method for the subjective assessment of intermediate quality level of audio systems,
“ITU-R Rec. BS.1534-3: Method for the subjective assessment of intermediate quality level of audio systems,” 2015
2015
-
[22]
EBU R-128 Loudness Normalisation and Permitted Maximum Level of Audio Signals,
“EBU R-128 Loudness Normalisation and Permitted Maximum Level of Audio Signals,” 2023
2023
-
[23]
Codec for immersive voice and audio services (IV AS); c code (floating-point),
“Codec for immersive voice and audio services (IV AS); c code (floating-point),” 2024. https://www.3gpp.org/ftp/ Specs/archive/26 series/26.258/26258-i20.zip
2024
-
[24]
Overview of the EVS codec architecture,
M. Dietz, M. Multrus, V . Eksler, V . Malenovsky, E. Norvell, H. Pobloth, L. Miao, Z. Wang, L. Laaksonen, A. Vasilache, Y . Kamamoto, K. Kikuiri, S. Ragot, J. Faure, H. Ehara, V . Rajendran, V . Atti, H. Sung, E. Oh, H. Yuan, and C. Zhu, “Overview of the EVS codec architecture,” inIEEE Int. Conf. Acous., Speech Sig. Proc. (ICASSP), (South Brisbane, QLD, A...
2015
-
[25]
Auditory Localization in Low-Bitrate Compressed Ambisonic Scenes,
T. Rudzki, I. Gomez-Lanzaco, J. Stubbs, J. Skoglund, D. T. Murphy, and G. Kearney, “Auditory Localization in Low-Bitrate Compressed Ambisonic Scenes,”Applied Sciences, vol. 9, June 2019
2019
-
[26]
Ambisonics Binaural Rendering via Masked Magnitude Least Squares,
O. Berebi, F. Brinkmann, S. Weinzierl, and B. Rafaely, “Ambisonics Binaural Rendering via Masked Magnitude Least Squares,” inIEEE Int. Conf. Acous., Speech Sig. Proc. (ICASSP), Apr. 2025
2025
-
[27]
IEM Plug-in Suite,
IEM, “IEM Plug-in Suite,” Nov. 2021. https://plugins. iem.at v1.13
2021
-
[28]
All-Round Ambisonic Panning and Decoding,
F. Zotter and M. Frank, “All-Round Ambisonic Panning and Decoding,”J. Audio Eng. Soc., vol. 60, no. 10, 2012
2012
-
[29]
ITU-R Rec. BS.1116-3 – Methods for the subjective assessment of small impairments in audio systems,
“ITU-R Rec. BS.1116-3 – Methods for the subjective assessment of small impairments in audio systems,” 2015
2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.