arxiv: 2502.05139 · v1 · pith:HSTTNVUOnew · submitted 2025-02-07 · 💻 cs.SD · cs.LG· eess.AS

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Andros Tjandra , Yi-Chiao Wu , Baishan Guo , John Hoffman , Brian Ellis , Apoorv Vyas , Bowen Shi , Sanyuan Chen

show 5 more authors

Matt Le Nick Zacharov Carleigh Wood Ann Lee Wei-Ning Hsu

This is my paper

Pith reviewed 2026-05-17 00:25 UTC · model grok-4.3

classification 💻 cs.SD cs.LGeess.AS

keywords audio aestheticsquality assessmentno-reference modelsspeech evaluationmusic qualityautomatic MOS predictionannotation guidelinesgenerative audio

0 comments

The pith

Decomposing audio aesthetics into four axes lets automatic models predict quality for speech, music, and sound at human-comparable levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes new annotation guidelines that split human listening perspectives into four distinct axes to support training of no-reference models for per-item audio quality assessment. This matters because traditional evaluation relies on human listeners, which creates inconsistencies and high costs, especially as generative audio models scale and require fast ways to filter data or assign pseudo-labels. The trained models are tested against human mean opinion scores and prior methods, showing performance that matches or exceeds them across speech, music, and sound. The approach unifies evaluation in one framework rather than separate tools per domain. Code, models, and datasets are released to allow others to apply and extend the work.

Core claim

We introduce annotation guidelines that decompose human listening perspectives into four distinct axes and train no-reference per-item prediction models that assess audio aesthetic quality for speech, music, and sound, achieving performance comparable or superior to existing methods when measured against human mean opinion scores.

What carries the argument

The four-axis annotation guidelines that decompose subjective listening perspectives into separate components for training unified no-reference prediction models.

If this is right

Enables scalable filtering and curation of large audio datasets without repeated human listening.
Supports pseudo-labeling for training and improving generative audio models.
Provides a single evaluation approach that covers speech, music, and general sound in one system.
Allows consistent benchmarking of new generative models against a fixed automated scorer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The models could be adapted for real-time quality monitoring in audio streaming or editing software.
Cultural variation in aesthetics might require adding or weighting axes differently for global applications.
Combining these predictions with other signals like content safety could improve automated audio moderation pipelines.

Load-bearing premise

The four-axis annotation guidelines sufficiently capture the subjective and culturally influenced nature of audio aesthetics for the tested domains and generalize to new data.

What would settle it

A new test set of audio samples drawn from different cultural contexts where human mean opinion scores show low correlation with the four-axis model predictions would indicate the guidelines fail to generalize.

read the original abstract

The quantification of audio aesthetics remains a complex challenge in audio processing, primarily due to its subjective nature, which is influenced by human perception and cultural context. Traditional methods often depend on human listeners for evaluation, leading to inconsistencies and high resource demands. This paper addresses the growing need for automated systems capable of predicting audio aesthetics without human intervention. Such systems are crucial for applications like data filtering, pseudo-labeling large datasets, and evaluating generative audio models, especially as these models become more sophisticated. In this work, we introduce a novel approach to audio aesthetic evaluation by proposing new annotation guidelines that decompose human listening perspectives into four distinct axes. We develop and train no-reference, per-item prediction models that offer a more nuanced assessment of audio quality. Our models are evaluated against human mean opinion scores (MOS) and existing methods, demonstrating comparable or superior performance. This research not only advances the field of audio aesthetics but also provides open-source models and datasets to facilitate future work and benchmarking. We release our code and pre-trained model at: https://github.com/facebookresearch/audiobox-aesthetics

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a practical four-axis no-reference predictor for audio quality across domains with open code, but the annotation reliability is not shown in the numbers.

read the letter

The main thing to know is that the authors created four-axis annotation guidelines for audio aesthetics and trained unified no-reference models that they report perform at or above existing methods when checked against their collected human MOS. They also released the code and models, which makes the work immediately usable for filtering datasets or scoring generative outputs in speech, music, and sound. That combination is the core value here rather than any deep theoretical shift. What they did well is move beyond single-score MOS predictors by breaking listening into multiple axes and applying the same model across three domains. This is a reasonable engineering extension of prior single-domain work, and the open release at the GitHub link lets others test it directly on their own data without rebuilding from scratch. The abstract's claim of comparable or superior performance suggests the models are competitive under the conditions they tested. The soft spot is exactly where the stress-test note flags it: no reported inter-annotator agreement, ICC scores, or checks for stability across annotator pools or cultural groups. Without those numbers, the training targets could carry substantial label noise, which would make the performance numbers look stronger on the collected data than they would on fresh material. The abstract also skips details on data splits, per-domain error analysis, or how much the unified model relies on one domain dominating the training set. Those gaps leave open whether the four axes generalize or mainly reflect the annotation process used here. This paper is for audio ML practitioners who need fast automated quality checks for curation or model evaluation rather than for theorists looking for new formal results. A reader building generative systems or managing large audio collections would get direct value from the released artifacts and could adapt the axes for their own needs. It shows clear practical thinking and honest engagement with the evaluation problem, so it deserves a serious referee rather than a desk reject. I would recommend sending it to peer review, with the specific request that the authors add the missing reliability metrics and cross-domain breakdown in the revision.

Referee Report

2 major / 1 minor

Summary. The paper introduces four-axis annotation guidelines to decompose subjective audio aesthetics for speech, music, and sound into distinct perspectives. It trains no-reference per-item prediction models on these axes and reports that the resulting models achieve comparable or superior performance to human mean opinion scores (MOS) and prior methods. The work emphasizes applications in data filtering, pseudo-labeling, and generative model evaluation, and releases code, models, and datasets.

Significance. If the four-axis labels prove reliable and the performance gains hold under proper validation, the framework offers a practical, unified no-reference tool for audio quality assessment that could reduce reliance on costly human listening tests. The open-source release of models and data is a clear strength for reproducibility and community benchmarking.

major comments (2)

[Annotation guidelines and data collection] The central performance claim rests on the four-axis guidelines producing consistent training targets. The manuscript does not report inter-annotator reliability statistics (e.g., ICC, Krippendorff’s alpha, or per-axis pairwise agreement) for the collected annotations. Without these metrics, it is impossible to assess label noise or stability across annotator pools, which directly affects whether the reported MOS correlations reflect genuine generalization or annotation artifacts.
[Experiments and results] The evaluation section compares models to human MOS and baselines but provides no details on train/test splits, cross-domain testing, or statistical significance testing of the claimed improvements. This makes it difficult to determine whether the “comparable or superior” result is robust or sensitive to particular data partitions.

minor comments (1)

[Abstract] The abstract refers to “four distinct axes” without naming them; explicitly listing the axes (e.g., in the introduction or guidelines section) would improve immediate clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper to incorporate the requested details on annotation reliability and experimental protocols.

read point-by-point responses

Referee: [Annotation guidelines and data collection] The central performance claim rests on the four-axis guidelines producing consistent training targets. The manuscript does not report inter-annotator reliability statistics (e.g., ICC, Krippendorff’s alpha, or per-axis pairwise agreement) for the collected annotations. Without these metrics, it is impossible to assess label noise or stability across annotator pools, which directly affects whether the reported MOS correlations reflect genuine generalization or annotation artifacts.

Authors: We agree that inter-annotator reliability metrics are necessary to substantiate the quality of the four-axis annotations. These statistics (including ICC and Krippendorff’s alpha per axis) were computed as part of our internal validation but omitted from the initial submission. The revised manuscript will include a new subsection under Data Collection that reports these values along with per-axis pairwise agreement rates, allowing readers to evaluate label consistency directly. revision: yes
Referee: [Experiments and results] The evaluation section compares models to human MOS and baselines but provides no details on train/test splits, cross-domain testing, or statistical significance testing of the claimed improvements. This makes it difficult to determine whether the “comparable or superior” result is robust or sensitive to particular data partitions.

Authors: We acknowledge the need for greater transparency in the experimental setup. The revised version will expand the Experiments section to explicitly describe the train/test split strategy (including proportions and any stratification by domain), detail the cross-domain testing protocol across speech, music, and general sound, and present statistical significance results (e.g., paired t-tests or Wilcoxon tests with p-values) for all reported improvements over baselines and human MOS. These additions will be supported by updated tables and text. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims grounded in external human MOS comparisons

full rationale

The paper introduces new four-axis annotation guidelines and trains no-reference models to predict aesthetic scores along those axes. Central claims rest on direct comparison of model outputs to human mean opinion scores (MOS) collected under the guidelines plus benchmarks against prior methods. This is standard external validation against human judgments rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. No equations, derivations, or citations in the provided text reduce the reported performance to the inputs by construction. The evaluation protocol remains falsifiable against held-out human data and existing baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that human aesthetic judgments can be decomposed into four stable axes that are learnable from audio features alone. No explicit free parameters or invented entities are named in the abstract; the models are trained on human MOS collected under the new guidelines.

axioms (1)

domain assumption Human listening perspectives on audio quality can be decomposed into four distinct, consistent axes that generalize across speech, music, and sound.
This premise underpins the new annotation guidelines and the claim that the resulting models provide nuanced assessment.

pith-pipeline@v0.9.0 · 5537 in / 1191 out tokens · 25764 ms · 2026-05-17T00:25:32.689178+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation
cs.SD 2026-05 unverdicted novelty 7.0

TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rh...
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
cs.CV 2026-04 unverdicted novelty 7.0

Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...
VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories
cs.SD 2026-04 unverdicted novelty 7.0

VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.
MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline
cs.SD 2026-02 unverdicted novelty 7.0

MIDI-SAG generates consistent long-form singing accompaniments by feeding symbolic MIDI timing, chords, and structure labels into a compositional pipeline built from pre-trained modules.
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
cs.SD 2026-01 unverdicted novelty 7.0

A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.
AVI-Edit: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner
cs.CV 2025-12 unverdicted novelty 7.0

AVI-Edit enables precise audio-synchronized instance-level video editing via a granularity-aware mask refiner, a self-feedback audio agent, and a new large-scale annotated dataset.
VABench: A Comprehensive Benchmark for Audio-Video Generation
cs.CV 2025-12 unverdicted novelty 7.0

VABench is a new multi-dimensional benchmark for evaluating synchronous audio-video generation across text-to-AV, image-to-AV, and stereo tasks.
FSD50K-Solo: Automated Curation of Single-Source Sound Events
eess.AS 2026-05 conditional novelty 6.0

The authors present a scalable curation method that combines diffusion-based mixture synthesis with a discriminative classifier to automatically extract single-source sound events from FSD50K and release the cleaned F...
AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling
cs.SD 2026-05 unverdicted novelty 6.0

AuDirector is a self-reflective closed-loop multi-agent framework that generates immersive audio narratives with improved structural coherence, emotional expressiveness, and acoustic fidelity via identity-aware voice ...
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
cs.SD 2026-05 unverdicted novelty 6.0

VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
eess.AS 2026-05 unverdicted novelty 6.0

JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.
APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music
cs.SD 2026-05 unverdicted novelty 6.0

APEX jointly predicts engagement-based popularity and five aesthetic quality dimensions for AI-generated music, improving human preference prediction on out-of-distribution generative systems.
OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

OmniHuman is a new large-scale multi-scene dataset with video-, frame-, and individual-level annotations for human-centric video generation, accompanied by the OHBench benchmark that adds metrics aligned with human pe...
SongBench: A Fine-Grained Multi-Aspect Benchmark for Song Quality Assessment
eess.AS 2026-04 unverdicted novelty 6.0

SongBench is a new fine-grained benchmark for song quality assessment with seven dimensions and an expert-annotated dataset of 11,717 samples showing high correlation with professional ratings.
Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation
cs.CL 2026-04 unverdicted novelty 6.0

SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...
The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor
cs.HC 2026-01 conditional novelty 6.0

LAION-Aesthetics Predictor reinforces Western and male biases by preferentially selecting images associated with women and realistic Western/Japanese art while excluding men, LGBTQ+ references, and other styles.
Scaling Properties of Continuous Diffusion Spoken Language Models
cs.CL 2026-04 unverdicted novelty 5.0

Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 18 Pith papers · 9 internal anchors

[1]

Davis and Paul Mermelstein , Journal =

Steven B. Davis and Paul Mermelstein , Journal =. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences , Volume =

work page
[2]

Rabiner , Journal =

Lawrence R. Rabiner , Journal =. A Tutorial on Hidden

work page
[3]

The Elements of Statistical Learning -- Data Mining, Inference, and Prediction , Year =

Trevor Hastie and Robert Tibshirani and Jerome Friedman , Publisher =. The Elements of Statistical Learning -- Data Mining, Inference, and Prediction , Year =

work page
[4]

A really good paper about

Jane Smith and Firstname2 Lastname2 and Firstname3 Lastname3 , Pages =. A really good paper about. Proc

work page
[5]

An excellent paper introducing the

Robert Jones and Firstname2 Lastname2 and Firstname3 Lastname3 , Crossref =. An excellent paper introducing the

work page
[6]

Moore and Lucy Skidmore , title=

Roger K. Moore and Lucy Skidmore , title=. Proc

work page
[7]

IEEE Journal of Selected Topics in Signal Processing , volume=

Wavlm: Large-scale self-supervised pre-training for full stack speech processing , author=. IEEE Journal of Selected Topics in Signal Processing , volume=. 2022 , publisher=

work page 2022
[11]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page
[13]

ICASSP 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors , author=. ICASSP 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2021 , organization=

work page 2021
[14]

The T05 System for The

Kaito, Baba and Wataru, Nakata and Yuki, Saito and Hiroshi, Saruwatari , booktitle =. The T05 System for The

work page
[15]

ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Torchaudio-squim: Reference-less speech quality and intelligibility measures in torchaudio , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2023 , organization=

work page 2023
[17]

11th ISCA Speech Synthesis Workshop (SSW 11) , year=

How do Voices from Past Speech Synthesis Challenges Compare Today? , author=. 11th ISCA Speech Synthesis Workshop (SSW 11) , year=

work page
[18]

Interspeech 2022 , year=

The VoiceMOS Challenge 2022 , author=. Interspeech 2022 , year=

work page 2022
[19]

ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

ESPnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit , author=. ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2020 , organization=

work page 2020
[20]

The blizzard challenge 2019 , author=. Proc. Blizzard Challenge Workshop , volume=

work page 2019
[21]

2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

Audio set: An ontology and human-labeled dataset for audio events , author=. 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2017 , organization=

work page 2017
[24]

Richter, Julius and Wu, Yi-Chiao and Krenn, Steven and Welker, Simon and Lay, Bunlong and Watanabe, Shinjii and Richard, Alexander and Gerkmann, Timo , booktitle=

work page
[25]

2023 , booktitle =

Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis , author =. 2023 , booktitle =. doi:10.21437/Interspeech.2023-1905 , issn =

work page doi:10.21437/interspeech.2023-1905 2023
[27]

and Branson, M

Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G. , title =. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) , pages =

work page 2020
[28]

and Beerends, J.G

Rix, A.W. and Beerends, J.G. and Hollier, M.P. and Hekstra, A.P. , booktitle=. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , year=

work page
[29]

Journal of the Audio Engineering Society , volume=

Perceptual objective listening quality assessment (POLQA), the third generation itu-t standard for end-to-end speech quality measurement part i—temporal alignment , author=. Journal of the Audio Engineering Society , volume=. 2013 , publisher=

work page 2013
[32]

International conference on machine learning , pages=

Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023
[33]

IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP , year =

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation , author =. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP , year =

work page
[36]

2020 twelfth international conference on quality of multimedia experience (QoMEX) , pages=

ViSQOL v3: An open source production ready objective speech and audio metric , author=. 2020 twelfth international conference on quality of multimedia experience (QoMEX) , pages=. 2020 , organization=

work page 2020
[38]

International Telecommunications Union—Radiocommunication (ITU-T) , year=

Recommendation p.1401: Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models , author=. International Telecommunications Union—Radiocommunication (ITU-T) , year=

work page
[42]

Interspeech , year=

MOSNet: Deep Learning-Based Objective Assessment for Voice Conversion , author=. Interspeech , year=

work page
[43]

ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Generalization ability of MOS prediction networks , author=. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2022 , organization=

work page 2022
[45]

2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=

LE-SSL-MOS: Self-Supervised Learning MOS Prediction with Listener Enhancement , author=. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=. 2023 , organization=

work page 2023
[46]

LAION-AESTHETICS , author=

work page
[47]

, author=

Quality Degradation Diagnosis for Voice Networks-Estimating the Perceived Noisiness, Coloration, and Discontinuity of Transmitted Speech. , author=. INTERSPEECH , pages=

work page
[48]

Interspeech 2021 , year=

NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets , author=. Interspeech 2021 , year=

work page 2021
[49]

International Telecommunications Union—Radiocommunication (ITU-T), 2001

Recommendation p.1401: Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models. International Telecommunications Union—Radiocommunication (ITU-T), 2001

work page 2001
[50]

MusicLM: Generating Music From Text

Andrea Agostinelli, Timo I Denk, Zal \'a n Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Ardila, M

R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4211--4215, 2020

work page 2020
[52]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arxiv 2016. arXiv preprint arXiv:1607.06450, 1, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[53]

Perceptual objective listening quality assessment (polqa), the third generation itu-t standard for end-to-end speech quality measurement part i—temporal alignment

John G Beerends, Christian Schmidmer, Jens Berger, Matthias Obermann, Raphael Ullmann, Joachim Pomy, and Michael Keyhl. Perceptual objective listening quality assessment (polqa), the third generation itu-t standard for end-to-end speech quality measurement part i—temporal alignment. Journal of the Audio Engineering Society, 61 0 (6): 0 366--384, 2013

work page 2013
[54]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16 0 (6): 0 1505--1518, 2022

work page 2022
[55]

Visqol v3: An open source production ready objective speech and audio metric

Michael Chinen, Felicia SC Lim, Jan Skoglund, Nikita Gureev, Feargus O'Gorman, and Andrew Hines. Visqol v3: An open source production ready objective speech and audio metric. In 2020 twelfth international conference on quality of multimedia experience (QoMEX), pages 1--6. IEEE, 2020

work page 2020
[56]

How do voices from past speech synthesis challenges compare today? In 11th ISCA Speech Synthesis Workshop (SSW 11)

Erica Cooper and Junichi Yamagishi. How do voices from past speech synthesis challenges compare today? In 11th ISCA Speech Synthesis Workshop (SSW 11). ISCA, 2021

work page 2021
[57]

Investigating range-equalizing bias in mean opinion score ratings of synthesized speech

Erica Cooper and Junichi Yamagishi. Investigating range-equalizing bias in mean opinion score ratings of synthesized speech. arXiv preprint arXiv:2305.10608, 2023

work page arXiv 2023
[58]

Generalization ability of mos prediction networks

Erica Cooper, Wen-Chin Huang, Tomoki Toda, and Junichi Yamagishi. Generalization ability of mos prediction networks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8442--8446. IEEE, 2022

work page 2022
[59]

Pam: Prompting audio-language models for audio quality assessment

Soham Deshmukh, Dareen Alharthi, Benjamin Elizalde, Hannes Gamper, Mahmoud Al Ismail, Rita Singh, Bhiksha Raj, and Huaming Wang. Pam: Prompting audio-language models for audio quality assessment. arXiv preprint arXiv:2402.00282, 2023

work page arXiv 2023
[60]

High Fidelity Neural Audio Compression

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[61]

Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM

Szu-Wei Fu, Yu Tsao, Hsin-Te Hwang, and Hsin-Min Wang. Quality-net: An end-to-end non-intrusive speech quality assessment model based on blstm. arXiv preprint arXiv:1808.05344, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[62]

Audio set: An ontology and human-labeled dataset for audio events

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776--780. IEEE, 2017

work page 2017
[63]

Espnet-tts: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit

Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, and Xu Tan. Espnet-tts: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 7654--7658. I...

work page 2020
[64]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[65]

The voicemos challenge 2022

Wen Chin Huang, Erica Cooper, Yu Tsao, Hsin-Min Wang, Tomoki Toda, and Junichi Yamagishi. The voicemos challenge 2022. Interspeech 2022, 2022

work page 2022
[66]

MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models

Wen-Chin Huang, Erica Cooper, and Tomoki Toda. Mos-bench: Benchmarking generalization abilities of subjective speech quality assessment models. arXiv preprint arXiv:2411.03715, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

The voicemos challenge 2024: Beyond speech quality prediction

Wen-Chin Huang, Szu-Wei Fu, Erica Cooper, Ryandhimas E Zezario, Tomoki Toda, Hsin-Min Wang, Junichi Yamagishi, and Yu Tsao. The voicemos challenge 2024: Beyond speech quality prediction. arXiv preprint arXiv:2409.07001, 2024 b

work page arXiv 2024
[68]

The t05 system for the V oice MOS C hallenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech

Baba Kaito, Nakata Wataru, Saito Yuki, and Saruwatari Hiroshi. The t05 system for the V oice MOS C hallenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech. In IEEE Spoken Language Technology Workshop (SLT), 2024

work page 2024
[69]

Fr\'echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms

Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. Frechet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[70]

A udio C aps: Generating captions for audios in the wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. A udio C aps: Generating captions for audios in the wild. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Pa...

work page doi:10.18653/v1/n19-1011 2019
[71]

Torchaudio-squim: Reference-less speech quality and intelligibility measures in torchaudio

Anurag Kumar, Ke Tan, Zhaoheng Ni, Pranay Manocha, Xiaohui Zhang, Ethan Henderson, and Buye Xu. Torchaudio-squim: Reference-less speech quality and intelligibility measures in torchaudio. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE, 2023

work page 2023
[72]

Laion-aesthetics

LAION. Laion-aesthetics. https://laion.ai/blog/laion-aesthetics/. Accesssed: 2024-12-06

work page 2024
[73]

The Llama 3 Herd of Models

AI at Meta Llama Team. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

Mosnet: Deep learning-based objective assessment for voice conversion

Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. Mosnet: Deep learning-based objective assessment for voice conversion. Interspeech, 2019

work page 2019
[75]

Quality degradation diagnosis for voice networks-estimating the perceived noisiness, coloration, and discontinuity of transmitted speech

Gabriel Mittag and Sebastian M \"o ller. Quality degradation diagnosis for voice networks-estimating the perceived noisiness, coloration, and discontinuity of transmitted speech. In INTERSPEECH, pages 3426--3430, 2019

work page 2019
[76]

Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets

Gabriel Mittag, Babak Naderi, Assmaa Chehadi, and Sebastian M \"o ller. Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets. Interspeech 2021, 2021

work page 2021
[77]

AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech

Brian Patton, Yannis Agiomyrgiannakis, Michael Terry, Kevin Wilson, Rif A Saurous, and D Sculley. Automos: Learning a non-intrusive assessor of naturalness-of-speech. arXiv preprint arXiv:1611.09207, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[78]

Le-ssl-mos: Self-supervised learning mos prediction with listener enhancement

Zili Qi, Xinhui Hu, Wangjin Zhou, Sheng Li, Hao Wu, Jian Lu, and Xinkang Xu. Le-ssl-mos: Self-supervised learning mos prediction with listener enhancement. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1--6. IEEE, 2023

work page 2023
[79]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492--28518. PMLR, 2023

work page 2023
[80]

MUSDB18-HQ - an uncompressed version of musdb18,

Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. Musdb18-hq - an uncompressed version of musdb18, August 2019. https://doi.org/10.5281/zenodo.3338373

work page doi:10.5281/zenodo.3338373 2019
[81]

Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors

Chandan KA Reddy, Vishak Gopal, and Ross Cutler. Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In ICASSP 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6493--6497. IEEE, 2021

work page 2021
[82]

EARS : An anechoic fullband speech dataset benchmarked for speech enhancement and dereverberation

Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinjii Watanabe, Alexander Richard, and Timo Gerkmann. EARS : An anechoic fullband speech dataset benchmarked for speech enhancement and dereverberation. In Interspeech, 2024

work page 2024
[83]

Rix, J.G

A.W. Rix, J.G. Beerends, M.P. Hollier, and A.P. Hekstra. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2001. doi:10.1109/ICASSP.2001.941023

work page doi:10.1109/icassp.2001.941023 2001
[84]

Utmos: Utokyo-sarulab system for voicemos challenge 2022

Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. Utmos: Utokyo-sarulab system for voicemos challenge 2022. arXiv preprint arXiv:2204.02152, 2022

work page arXiv 2022
[85]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. https://...

work page 2017
[86]

Audiobox: Unified audio generation with natural language prompts

Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, et al. Audiobox: Unified audio generation with natural language prompts. arXiv preprint arXiv:2312.15821, 2023

work page arXiv 2023
[87]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023

work page 2023
[88]

The blizzard challenge 2019

Zhizheng Wu, Zhihang Xie, and Simon King. The blizzard challenge 2019. In Proc. Blizzard Challenge Workshop, volume 2019, 2019

work page 2019
[89]

Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu

Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. Libritts: A corpus derived from librispeech for text-to-speech. In Interspeech 2019, pages 1526--1530, 2019. doi:10.21437/Interspeech.2019-2441

work page doi:10.21437/interspeech.2019-2441 2019