pith. machine review for the scientific record. sign in

arxiv: 2502.05139 · v1 · pith:HSTTNVUOnew · submitted 2025-02-07 · 💻 cs.SD · cs.LG· eess.AS

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Pith reviewed 2026-05-17 00:25 UTC · model grok-4.3

classification 💻 cs.SD cs.LGeess.AS
keywords audio aestheticsquality assessmentno-reference modelsspeech evaluationmusic qualityautomatic MOS predictionannotation guidelinesgenerative audio
0
0 comments X

The pith

Decomposing audio aesthetics into four axes lets automatic models predict quality for speech, music, and sound at human-comparable levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes new annotation guidelines that split human listening perspectives into four distinct axes to support training of no-reference models for per-item audio quality assessment. This matters because traditional evaluation relies on human listeners, which creates inconsistencies and high costs, especially as generative audio models scale and require fast ways to filter data or assign pseudo-labels. The trained models are tested against human mean opinion scores and prior methods, showing performance that matches or exceeds them across speech, music, and sound. The approach unifies evaluation in one framework rather than separate tools per domain. Code, models, and datasets are released to allow others to apply and extend the work.

Core claim

We introduce annotation guidelines that decompose human listening perspectives into four distinct axes and train no-reference per-item prediction models that assess audio aesthetic quality for speech, music, and sound, achieving performance comparable or superior to existing methods when measured against human mean opinion scores.

What carries the argument

The four-axis annotation guidelines that decompose subjective listening perspectives into separate components for training unified no-reference prediction models.

If this is right

  • Enables scalable filtering and curation of large audio datasets without repeated human listening.
  • Supports pseudo-labeling for training and improving generative audio models.
  • Provides a single evaluation approach that covers speech, music, and general sound in one system.
  • Allows consistent benchmarking of new generative models against a fixed automated scorer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The models could be adapted for real-time quality monitoring in audio streaming or editing software.
  • Cultural variation in aesthetics might require adding or weighting axes differently for global applications.
  • Combining these predictions with other signals like content safety could improve automated audio moderation pipelines.

Load-bearing premise

The four-axis annotation guidelines sufficiently capture the subjective and culturally influenced nature of audio aesthetics for the tested domains and generalize to new data.

What would settle it

A new test set of audio samples drawn from different cultural contexts where human mean opinion scores show low correlation with the four-axis model predictions would indicate the guidelines fail to generalize.

read the original abstract

The quantification of audio aesthetics remains a complex challenge in audio processing, primarily due to its subjective nature, which is influenced by human perception and cultural context. Traditional methods often depend on human listeners for evaluation, leading to inconsistencies and high resource demands. This paper addresses the growing need for automated systems capable of predicting audio aesthetics without human intervention. Such systems are crucial for applications like data filtering, pseudo-labeling large datasets, and evaluating generative audio models, especially as these models become more sophisticated. In this work, we introduce a novel approach to audio aesthetic evaluation by proposing new annotation guidelines that decompose human listening perspectives into four distinct axes. We develop and train no-reference, per-item prediction models that offer a more nuanced assessment of audio quality. Our models are evaluated against human mean opinion scores (MOS) and existing methods, demonstrating comparable or superior performance. This research not only advances the field of audio aesthetics but also provides open-source models and datasets to facilitate future work and benchmarking. We release our code and pre-trained model at: https://github.com/facebookresearch/audiobox-aesthetics

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces four-axis annotation guidelines to decompose subjective audio aesthetics for speech, music, and sound into distinct perspectives. It trains no-reference per-item prediction models on these axes and reports that the resulting models achieve comparable or superior performance to human mean opinion scores (MOS) and prior methods. The work emphasizes applications in data filtering, pseudo-labeling, and generative model evaluation, and releases code, models, and datasets.

Significance. If the four-axis labels prove reliable and the performance gains hold under proper validation, the framework offers a practical, unified no-reference tool for audio quality assessment that could reduce reliance on costly human listening tests. The open-source release of models and data is a clear strength for reproducibility and community benchmarking.

major comments (2)
  1. [Annotation guidelines and data collection] The central performance claim rests on the four-axis guidelines producing consistent training targets. The manuscript does not report inter-annotator reliability statistics (e.g., ICC, Krippendorff’s alpha, or per-axis pairwise agreement) for the collected annotations. Without these metrics, it is impossible to assess label noise or stability across annotator pools, which directly affects whether the reported MOS correlations reflect genuine generalization or annotation artifacts.
  2. [Experiments and results] The evaluation section compares models to human MOS and baselines but provides no details on train/test splits, cross-domain testing, or statistical significance testing of the claimed improvements. This makes it difficult to determine whether the “comparable or superior” result is robust or sensitive to particular data partitions.
minor comments (1)
  1. [Abstract] The abstract refers to “four distinct axes” without naming them; explicitly listing the axes (e.g., in the introduction or guidelines section) would improve immediate clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper to incorporate the requested details on annotation reliability and experimental protocols.

read point-by-point responses
  1. Referee: [Annotation guidelines and data collection] The central performance claim rests on the four-axis guidelines producing consistent training targets. The manuscript does not report inter-annotator reliability statistics (e.g., ICC, Krippendorff’s alpha, or per-axis pairwise agreement) for the collected annotations. Without these metrics, it is impossible to assess label noise or stability across annotator pools, which directly affects whether the reported MOS correlations reflect genuine generalization or annotation artifacts.

    Authors: We agree that inter-annotator reliability metrics are necessary to substantiate the quality of the four-axis annotations. These statistics (including ICC and Krippendorff’s alpha per axis) were computed as part of our internal validation but omitted from the initial submission. The revised manuscript will include a new subsection under Data Collection that reports these values along with per-axis pairwise agreement rates, allowing readers to evaluate label consistency directly. revision: yes

  2. Referee: [Experiments and results] The evaluation section compares models to human MOS and baselines but provides no details on train/test splits, cross-domain testing, or statistical significance testing of the claimed improvements. This makes it difficult to determine whether the “comparable or superior” result is robust or sensitive to particular data partitions.

    Authors: We acknowledge the need for greater transparency in the experimental setup. The revised version will expand the Experiments section to explicitly describe the train/test split strategy (including proportions and any stratification by domain), detail the cross-domain testing protocol across speech, music, and general sound, and present statistical significance results (e.g., paired t-tests or Wilcoxon tests with p-values) for all reported improvements over baselines and human MOS. These additions will be supported by updated tables and text. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims grounded in external human MOS comparisons

full rationale

The paper introduces new four-axis annotation guidelines and trains no-reference models to predict aesthetic scores along those axes. Central claims rest on direct comparison of model outputs to human mean opinion scores (MOS) collected under the guidelines plus benchmarks against prior methods. This is standard external validation against human judgments rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. No equations, derivations, or citations in the provided text reduce the reported performance to the inputs by construction. The evaluation protocol remains falsifiable against held-out human data and existing baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that human aesthetic judgments can be decomposed into four stable axes that are learnable from audio features alone. No explicit free parameters or invented entities are named in the abstract; the models are trained on human MOS collected under the new guidelines.

axioms (1)
  • domain assumption Human listening perspectives on audio quality can be decomposed into four distinct, consistent axes that generalize across speech, music, and sound.
    This premise underpins the new annotation guidelines and the claim that the resulting models provide nuanced assessment.

pith-pipeline@v0.9.0 · 5537 in / 1191 out tokens · 25764 ms · 2026-05-17T00:25:32.689178+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.

  2. TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

    cs.SD 2026-05 unverdicted novelty 7.0

    TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rh...

  3. Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

    cs.CV 2026-04 unverdicted novelty 7.0

    Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...

  4. VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories

    cs.SD 2026-04 unverdicted novelty 7.0

    VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.

  5. MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline

    cs.SD 2026-02 unverdicted novelty 7.0

    MIDI-SAG generates consistent long-form singing accompaniments by feeding symbolic MIDI timing, chords, and structure labels into a compositional pipeline built from pre-trained modules.

  6. Omni2Sound: Towards Unified Video-Text-to-Audio Generation

    cs.SD 2026-01 unverdicted novelty 7.0

    A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.

  7. AVI-Edit: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner

    cs.CV 2025-12 unverdicted novelty 7.0

    AVI-Edit enables precise audio-synchronized instance-level video editing via a granularity-aware mask refiner, a self-feedback audio agent, and a new large-scale annotated dataset.

  8. VABench: A Comprehensive Benchmark for Audio-Video Generation

    cs.CV 2025-12 unverdicted novelty 7.0

    VABench is a new multi-dimensional benchmark for evaluating synchronous audio-video generation across text-to-AV, image-to-AV, and stereo tasks.

  9. FSD50K-Solo: Automated Curation of Single-Source Sound Events

    eess.AS 2026-05 conditional novelty 6.0

    The authors present a scalable curation method that combines diffusion-based mixture synthesis with a discriminative classifier to automatically extract single-source sound events from FSD50K and release the cleaned F...

  10. AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling

    cs.SD 2026-05 unverdicted novelty 6.0

    AuDirector is a self-reflective closed-loop multi-agent framework that generates immersive audio narratives with improved structural coherence, emotional expressiveness, and acoustic fidelity via identity-aware voice ...

  11. VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

    cs.SD 2026-05 unverdicted novelty 6.0

    VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.

  12. JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

    eess.AS 2026-05 unverdicted novelty 6.0

    JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.

  13. APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

    cs.SD 2026-05 unverdicted novelty 6.0

    APEX jointly predicts engagement-based popularity and five aesthetic quality dimensions for AI-generated music, improving human preference prediction on out-of-distribution generative systems.

  14. OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    OmniHuman is a new large-scale multi-scene dataset with video-, frame-, and individual-level annotations for human-centric video generation, accompanied by the OHBench benchmark that adds metrics aligned with human pe...

  15. SongBench: A Fine-Grained Multi-Aspect Benchmark for Song Quality Assessment

    eess.AS 2026-04 unverdicted novelty 6.0

    SongBench is a new fine-grained benchmark for song quality assessment with seven dimensions and an expert-annotated dataset of 11,717 samples showing high correlation with professional ratings.

  16. Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation

    cs.CL 2026-04 unverdicted novelty 6.0

    SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...

  17. The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor

    cs.HC 2026-01 conditional novelty 6.0

    LAION-Aesthetics Predictor reinforces Western and male biases by preferentially selecting images associated with women and realistic Western/Japanese art while excluding men, LGBTQ+ references, and other styles.

  18. Scaling Properties of Continuous Diffusion Spoken Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 18 Pith papers · 9 internal anchors

  1. [1]

    Davis and Paul Mermelstein , Journal =

    Steven B. Davis and Paul Mermelstein , Journal =. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences , Volume =

  2. [2]

    Rabiner , Journal =

    Lawrence R. Rabiner , Journal =. A Tutorial on Hidden

  3. [3]

    The Elements of Statistical Learning -- Data Mining, Inference, and Prediction , Year =

    Trevor Hastie and Robert Tibshirani and Jerome Friedman , Publisher =. The Elements of Statistical Learning -- Data Mining, Inference, and Prediction , Year =

  4. [4]

    A really good paper about

    Jane Smith and Firstname2 Lastname2 and Firstname3 Lastname3 , Pages =. A really good paper about. Proc

  5. [5]

    An excellent paper introducing the

    Robert Jones and Firstname2 Lastname2 and Firstname3 Lastname3 , Crossref =. An excellent paper introducing the

  6. [6]

    Moore and Lucy Skidmore , title=

    Roger K. Moore and Lucy Skidmore , title=. Proc

  7. [7]

    IEEE Journal of Selected Topics in Signal Processing , volume=

    Wavlm: Large-scale self-supervised pre-training for full stack speech processing , author=. IEEE Journal of Selected Topics in Signal Processing , volume=. 2022 , publisher=

  8. [11]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  9. [13]

    ICASSP 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

    Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors , author=. ICASSP 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2021 , organization=

  10. [14]

    The T05 System for The

    Kaito, Baba and Wataru, Nakata and Yuki, Saito and Hiroshi, Saruwatari , booktitle =. The T05 System for The

  11. [15]

    ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

    Torchaudio-squim: Reference-less speech quality and intelligibility measures in torchaudio , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2023 , organization=

  12. [17]

    11th ISCA Speech Synthesis Workshop (SSW 11) , year=

    How do Voices from Past Speech Synthesis Challenges Compare Today? , author=. 11th ISCA Speech Synthesis Workshop (SSW 11) , year=

  13. [18]

    Interspeech 2022 , year=

    The VoiceMOS Challenge 2022 , author=. Interspeech 2022 , year=

  14. [19]

    ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

    ESPnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit , author=. ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2020 , organization=

  15. [20]

    The blizzard challenge 2019 , author=. Proc. Blizzard Challenge Workshop , volume=

  16. [21]

    2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

    Audio set: An ontology and human-labeled dataset for audio events , author=. 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2017 , organization=

  17. [24]

    Richter, Julius and Wu, Yi-Chiao and Krenn, Steven and Welker, Simon and Lay, Bunlong and Watanabe, Shinjii and Richard, Alexander and Gerkmann, Timo , booktitle=

  18. [25]

    2023 , booktitle =

    Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis , author =. 2023 , booktitle =. doi:10.21437/Interspeech.2023-1905 , issn =

  19. [27]

    and Branson, M

    Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G. , title =. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) , pages =

  20. [28]

    and Beerends, J.G

    Rix, A.W. and Beerends, J.G. and Hollier, M.P. and Hekstra, A.P. , booktitle=. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , year=

  21. [29]

    Journal of the Audio Engineering Society , volume=

    Perceptual objective listening quality assessment (POLQA), the third generation itu-t standard for end-to-end speech quality measurement part i—temporal alignment , author=. Journal of the Audio Engineering Society , volume=. 2013 , publisher=

  22. [32]

    International conference on machine learning , pages=

    Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

  23. [33]

    IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP , year =

    Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation , author =. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP , year =

  24. [36]

    2020 twelfth international conference on quality of multimedia experience (QoMEX) , pages=

    ViSQOL v3: An open source production ready objective speech and audio metric , author=. 2020 twelfth international conference on quality of multimedia experience (QoMEX) , pages=. 2020 , organization=

  25. [38]

    International Telecommunications Union—Radiocommunication (ITU-T) , year=

    Recommendation p.1401: Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models , author=. International Telecommunications Union—Radiocommunication (ITU-T) , year=

  26. [42]

    Interspeech , year=

    MOSNet: Deep Learning-Based Objective Assessment for Voice Conversion , author=. Interspeech , year=

  27. [43]

    ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

    Generalization ability of MOS prediction networks , author=. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2022 , organization=

  28. [45]

    2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=

    LE-SSL-MOS: Self-Supervised Learning MOS Prediction with Listener Enhancement , author=. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=. 2023 , organization=

  29. [46]

    LAION-AESTHETICS , author=

  30. [47]

    , author=

    Quality Degradation Diagnosis for Voice Networks-Estimating the Perceived Noisiness, Coloration, and Discontinuity of Transmitted Speech. , author=. INTERSPEECH , pages=

  31. [48]

    Interspeech 2021 , year=

    NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets , author=. Interspeech 2021 , year=

  32. [49]

    International Telecommunications Union—Radiocommunication (ITU-T), 2001

    Recommendation p.1401: Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models. International Telecommunications Union—Radiocommunication (ITU-T), 2001

  33. [50]

    MusicLM: Generating Music From Text

    Andrea Agostinelli, Timo I Denk, Zal \'a n Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023

  34. [51]

    Ardila, M

    R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4211--4215, 2020

  35. [52]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arxiv 2016. arXiv preprint arXiv:1607.06450, 1, 2016

  36. [53]

    Perceptual objective listening quality assessment (polqa), the third generation itu-t standard for end-to-end speech quality measurement part i—temporal alignment

    John G Beerends, Christian Schmidmer, Jens Berger, Matthias Obermann, Raphael Ullmann, Joachim Pomy, and Michael Keyhl. Perceptual objective listening quality assessment (polqa), the third generation itu-t standard for end-to-end speech quality measurement part i—temporal alignment. Journal of the Audio Engineering Society, 61 0 (6): 0 366--384, 2013

  37. [54]

    Wavlm: Large-scale self-supervised pre-training for full stack speech processing

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16 0 (6): 0 1505--1518, 2022

  38. [55]

    Visqol v3: An open source production ready objective speech and audio metric

    Michael Chinen, Felicia SC Lim, Jan Skoglund, Nikita Gureev, Feargus O'Gorman, and Andrew Hines. Visqol v3: An open source production ready objective speech and audio metric. In 2020 twelfth international conference on quality of multimedia experience (QoMEX), pages 1--6. IEEE, 2020

  39. [56]

    How do voices from past speech synthesis challenges compare today? In 11th ISCA Speech Synthesis Workshop (SSW 11)

    Erica Cooper and Junichi Yamagishi. How do voices from past speech synthesis challenges compare today? In 11th ISCA Speech Synthesis Workshop (SSW 11). ISCA, 2021

  40. [57]

    Investigating range-equalizing bias in mean opinion score ratings of synthesized speech

    Erica Cooper and Junichi Yamagishi. Investigating range-equalizing bias in mean opinion score ratings of synthesized speech. arXiv preprint arXiv:2305.10608, 2023

  41. [58]

    Generalization ability of mos prediction networks

    Erica Cooper, Wen-Chin Huang, Tomoki Toda, and Junichi Yamagishi. Generalization ability of mos prediction networks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8442--8446. IEEE, 2022

  42. [59]

    Pam: Prompting audio-language models for audio quality assessment

    Soham Deshmukh, Dareen Alharthi, Benjamin Elizalde, Hannes Gamper, Mahmoud Al Ismail, Rita Singh, Bhiksha Raj, and Huaming Wang. Pam: Prompting audio-language models for audio quality assessment. arXiv preprint arXiv:2402.00282, 2023

  43. [60]

    High Fidelity Neural Audio Compression

    Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022

  44. [61]

    Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM

    Szu-Wei Fu, Yu Tsao, Hsin-Te Hwang, and Hsin-Min Wang. Quality-net: An end-to-end non-intrusive speech quality assessment model based on blstm. arXiv preprint arXiv:1808.05344, 2018

  45. [62]

    Audio set: An ontology and human-labeled dataset for audio events

    Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776--780. IEEE, 2017

  46. [63]

    Espnet-tts: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit

    Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, and Xu Tan. Espnet-tts: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 7654--7658. I...

  47. [64]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016

  48. [65]

    The voicemos challenge 2022

    Wen Chin Huang, Erica Cooper, Yu Tsao, Hsin-Min Wang, Tomoki Toda, and Junichi Yamagishi. The voicemos challenge 2022. Interspeech 2022, 2022

  49. [66]

    MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models

    Wen-Chin Huang, Erica Cooper, and Tomoki Toda. Mos-bench: Benchmarking generalization abilities of subjective speech quality assessment models. arXiv preprint arXiv:2411.03715, 2024 a

  50. [67]

    The voicemos challenge 2024: Beyond speech quality prediction

    Wen-Chin Huang, Szu-Wei Fu, Erica Cooper, Ryandhimas E Zezario, Tomoki Toda, Hsin-Min Wang, Junichi Yamagishi, and Yu Tsao. The voicemos challenge 2024: Beyond speech quality prediction. arXiv preprint arXiv:2409.07001, 2024 b

  51. [68]

    The t05 system for the V oice MOS C hallenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech

    Baba Kaito, Nakata Wataru, Saito Yuki, and Saruwatari Hiroshi. The t05 system for the V oice MOS C hallenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech. In IEEE Spoken Language Technology Workshop (SLT), 2024

  52. [69]

    Fr\'echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms

    Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. Frechet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018

  53. [70]

    A udio C aps: Generating captions for audios in the wild

    Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. A udio C aps: Generating captions for audios in the wild. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Pa...

  54. [71]

    Torchaudio-squim: Reference-less speech quality and intelligibility measures in torchaudio

    Anurag Kumar, Ke Tan, Zhaoheng Ni, Pranay Manocha, Xiaohui Zhang, Ethan Henderson, and Buye Xu. Torchaudio-squim: Reference-less speech quality and intelligibility measures in torchaudio. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE, 2023

  55. [72]

    Laion-aesthetics

    LAION. Laion-aesthetics. https://laion.ai/blog/laion-aesthetics/. Accesssed: 2024-12-06

  56. [73]

    The Llama 3 Herd of Models

    AI at Meta Llama Team. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  57. [74]

    Mosnet: Deep learning-based objective assessment for voice conversion

    Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. Mosnet: Deep learning-based objective assessment for voice conversion. Interspeech, 2019

  58. [75]

    Quality degradation diagnosis for voice networks-estimating the perceived noisiness, coloration, and discontinuity of transmitted speech

    Gabriel Mittag and Sebastian M \"o ller. Quality degradation diagnosis for voice networks-estimating the perceived noisiness, coloration, and discontinuity of transmitted speech. In INTERSPEECH, pages 3426--3430, 2019

  59. [76]

    Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets

    Gabriel Mittag, Babak Naderi, Assmaa Chehadi, and Sebastian M \"o ller. Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets. Interspeech 2021, 2021

  60. [77]

    AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech

    Brian Patton, Yannis Agiomyrgiannakis, Michael Terry, Kevin Wilson, Rif A Saurous, and D Sculley. Automos: Learning a non-intrusive assessor of naturalness-of-speech. arXiv preprint arXiv:1611.09207, 2016

  61. [78]

    Le-ssl-mos: Self-supervised learning mos prediction with listener enhancement

    Zili Qi, Xinhui Hu, Wangjin Zhou, Sheng Li, Hao Wu, Jian Lu, and Xinkang Xu. Le-ssl-mos: Self-supervised learning mos prediction with listener enhancement. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1--6. IEEE, 2023

  62. [79]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492--28518. PMLR, 2023

  63. [80]

    MUSDB18-HQ - an uncompressed version of musdb18,

    Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. Musdb18-hq - an uncompressed version of musdb18, August 2019. https://doi.org/10.5281/zenodo.3338373

  64. [81]

    Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors

    Chandan KA Reddy, Vishak Gopal, and Ross Cutler. Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In ICASSP 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6493--6497. IEEE, 2021

  65. [82]

    EARS : An anechoic fullband speech dataset benchmarked for speech enhancement and dereverberation

    Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinjii Watanabe, Alexander Richard, and Timo Gerkmann. EARS : An anechoic fullband speech dataset benchmarked for speech enhancement and dereverberation. In Interspeech, 2024

  66. [83]

    Rix, J.G

    A.W. Rix, J.G. Beerends, M.P. Hollier, and A.P. Hekstra. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2001. doi:10.1109/ICASSP.2001.941023

  67. [84]

    Utmos: Utokyo-sarulab system for voicemos challenge 2022

    Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. Utmos: Utokyo-sarulab system for voicemos challenge 2022. arXiv preprint arXiv:2204.02152, 2022

  68. [85]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. https://...

  69. [86]

    Audiobox: Unified audio generation with natural language prompts

    Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, et al. Audiobox: Unified audio generation with natural language prompts. arXiv preprint arXiv:2312.15821, 2023

  70. [87]

    Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

    Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023

  71. [88]

    The blizzard challenge 2019

    Zhizheng Wu, Zhihang Xie, and Simon King. The blizzard challenge 2019. In Proc. Blizzard Challenge Workshop, volume 2019, 2019

  72. [89]

    Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu

    Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. Libritts: A corpus derived from librispeech for text-to-speech. In Interspeech 2019, pages 1526--1530, 2019. doi:10.21437/Interspeech.2019-2441