NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets

Assmaa Chehadi; Babak Naderi; Gabriel Mittag; Sebastian M\"oller

arxiv: 2104.09494 · v1 · pith:GVTB2DHFnew · submitted 2021-04-19 · 📡 eess.AS · cs.AI· cs.LG· cs.SD

NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets

Gabriel Mittag , Babak Naderi , Assmaa Chehadi , Sebastian M\"oller This is my paper

classification 📡 eess.AS cs.AIcs.LGcs.SD

keywords modelspeechqualitydatasetsnisqaoverallpredictiontrained

0 comments

read the original abstract

In this paper, we present an update to the NISQA speech quality prediction model that is focused on distortions that occur in communication networks. In contrast to the previous version, the model is trained end-to-end and the time-dependency modelling and time-pooling is achieved through a Self-Attention mechanism. Besides overall speech quality, the model also predicts the four speech quality dimensions Noisiness, Coloration, Discontinuity, and Loudness, and in this way gives more insight into the cause of a quality degradation. Furthermore, new datasets with over 13,000 speech files were created for training and validation of the model. The model was finally tested on a new, live-talking test dataset that contains recordings of real telephone calls. Overall, NISQA was trained and evaluated on 81 datasets from different sources and showed to provide reliable predictions also for unknown speech samples. The code, model weights, and datasets are open-sourced.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition
cs.SD 2026-06 unverdicted novelty 7.0

Introduces a feature-vocoder adversarial attack on ASR using SSL representations that reports +26.6 WER black-box transfer and +36.2 WER defense resistance over baselines.
VABench: A Comprehensive Benchmark for Audio-Video Generation
cs.CV 2025-12 unverdicted novelty 7.0

VABench is a new multi-dimensional benchmark for evaluating synchronous audio-video generation across text-to-AV, image-to-AV, and stereo tasks.
Emo-LiPO: Listwise Preference Optimization for Fine-Grained Emotion Intensity Control in LLM-based Text-to-Speech
cs.SD 2026-06 unverdicted novelty 6.0

Emo-LiPO applies listwise preference optimization to model global emotion intensity ordering in LLM TTS, yielding better accuracy and controllability than supervised or DPO baselines on a new multi-speaker dataset.
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
eess.AS 2026-05 unverdicted novelty 6.0

JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.
Discrete Token Modeling for Multi-Stem Music Source Separation with Language Models
eess.AS 2026-04 unverdicted novelty 6.0

A Conformer-conditioned decoder-only language model generates discrete tokens via a neural audio codec to separate four music stems, reaching near state-of-the-art perceptual quality and top NISQA on vocals in MUSDB18...
Feature-Aligned Speech Watermarking for Robustness to Reconstruction Distortions
cs.SD 2026-06 unverdicted novelty 5.0

Feature-aligned watermarking embeds a codec-generated pseudo-speech signal into the spectrogram to raise robustness against reconstruction models while keeping imperceptibility comparable to prior methods.
Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment
eess.AS 2026-04 unverdicted novelty 3.0

Voice range indicates TTS model capability with VITS highest, Glow-TTS best at soft phonation, and CPPs of 7-8 dB marking natural quality while values over 10 dB sound robotic.
A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models
eess.AS 2026-05 unverdicted novelty 2.0

A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.