Ouvia is a user-centered evaluation framework for speech translation usability in real-world scenarios, showing limited usability rates and the superiority of QA-based metrics.
Rickford and Dan Jurafsky and Sharad Goel , title =
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
ASR bias causes users from underrepresented dialects to internalize failures as personal inadequacy and perform extensive emotional and linguistic labor, revealing harms missed by accuracy-only evaluations.
Layer-wise probing of wav2vec2-base and Whisper-small shows both models distinguish reduced vs. canonical consonant clusters in AAE with high accuracy and retain cues to underlying stops, encoding CCR as gradient variation.
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
Audit of multilingual clinical ASR reveals demographic biases; SamaVaani debiasing technique is proposed to jointly boost performance and fairness in Indian languages.
Random phoneme substitutions recover most ASR gains from synthetic accented speech, with targeted edits and ground-truth prosody providing only marginal additional benefits.
Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
citing papers explorer
-
Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios
Ouvia is a user-centered evaluation framework for speech translation usability in real-world scenarios, showing limited usability rates and the superiority of QA-based metrics.
-
Toward Calibrated, Fair, and accurate Deepfake Detection
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
-
"This Wasn't Made for Me": Recentering User Experience and Emotional Impact in the Evaluation of ASR Bias
ASR bias causes users from underrepresented dialects to internalize failures as personal inadequacy and perform extensive emotional and linguistic labor, revealing harms missed by accuracy-only evaluations.
-
Layer-wise Probing of wav2vec 2.0 and Whisper for Consonant Cluster Reduction in African American English
Layer-wise probing of wav2vec2-base and Whisper-small shows both models distinguish reduced vs. canonical consonant clusters in AAE with high accuracy and retain cues to underlying stops, encoding CCR as gradient variation.
-
Ethical and social risks of harm from Language Models
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
-
SamaVaani: Auditing and Debiasing Multilingual Clinical ASR for Indian Languages
Audit of multilingual clinical ASR reveals demographic biases; SamaVaani debiasing technique is proposed to jointly boost performance and fairness in Indian languages.
-
Few-Shot Synthetic Accented Speech for ASR Fine-Tuning: What Helps and When?
Random phoneme substitutions recover most ASR gains from synthetic accented speech, with targeted edits and ground-truth prosody providing only marginal additional benefits.
-
Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents
Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.
-
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.