Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues

Liang Yi; Li Lu; Peng Cheng; Qingcao Li; Qinglong Wang; Zhongjie Ba

arxiv: 2605.15984 · v1 · pith:UN5FDI2Dnew · submitted 2026-05-15 · 💻 cs.SD · cs.AI· cs.CR

Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues

Zhongjie Ba , Liang Yi , Peng Cheng , Qingcao Li , Qinglong Wang , Li Lu This is my paper

Pith reviewed 2026-05-19 18:31 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CR

keywords toxic speech detectionparalinguistic cuesaudio datasetdual-head neural networkToxiAlert-Benchspeech toxicitymulti-stage training

0 comments

The pith

A dual-head model that separates paralinguistic from textual toxicity sources raises Macro-F1 by 21 percent in speech detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates ToxiAlert-Bench, an audio dataset of over 30,000 clips that labels both the type of toxicity and whether it originates in the words or in features such as tone, emotion, and pace. It introduces a neural network with two heads, one that identifies the source of sensitivity and one that names the specific toxic category. The heads are first trained separately and then fine-tuned together while using balanced sampling and weighted losses to address uneven class sizes. This design yields consistent gains over baselines that ignore spoken delivery.

Core claim

The authors show that a dual-head neural network trained in multiple stages on a dataset that explicitly annotates whether toxicity stems from textual content or paralinguistic cues produces higher detection performance than prior single-task models, with a 21.1 percent relative Macro-F1 gain and 13.0 percent accuracy gain over the strongest baseline.

What carries the argument

Dual-head neural network with one head for identifying toxicity source (textual or paralinguistic) and a second head for toxic category classification, trained independently before joint fine-tuning.

If this is right

Detection systems gain the ability to flag cases where neutral words become toxic only because of delivery.
Staged training reduces conflict between source detection and type classification tasks.
Class-balanced sampling and weighted losses improve reliability on infrequent toxic categories.
The dataset supplies a benchmark for evaluating any future paralinguistic-aware toxicity model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Platforms could apply different moderation thresholds depending on whether toxicity is word-driven or tone-driven.
The source distinction might transfer to related audio tasks such as sarcasm or intent detection.
Real-time voice interfaces could incorporate the same two-head structure for live safety filtering.

Load-bearing premise

Human annotators can reliably and consistently distinguish whether a toxic speech clip derives its harm from the words themselves or from paralinguistic delivery features.

What would settle it

A new set of independent annotators re-labels a held-out portion of the clips for toxicity source and produces low agreement with the original labels.

Figures

Figures reproduced from arXiv: 2605.15984 by Liang Yi, Li Lu, Peng Cheng, Qingcao Li, Qinglong Wang, Zhongjie Ba.

**Figure 1.** Figure 1: Overview of the ToxiAlert-Bench dataset construction framework. Pipeline1 (left) illustrates the collection and anno [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the ToxiAlert training framework. Multi-Stage Training Strategy: Stage 1 trains the source head to detect [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Performance comparison on source-specific tox [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Fine-grained comparison of ToxiAlert and Gemini-2.5-Flash on ToxiAlert-Bench. We report per-category accuracy [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: An example annotation from ToxiAlert-Bench in [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of ToxiAlert-Bench taxonomy and examples. The wheel illustrates the 7 coarse-grained toxic categories [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: t-SNE visualization of KMeans clustering results [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: llustration of the unified multimodal prompt used [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 10.** Figure 10: Category-specific prompt examples used for gen [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 13.** Figure 13: Prompt and responses from Qwen2, GPT-4o, and [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 12.** Figure 12: Prompt and Gemini-2.5-Flash response for fine [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 14.** Figure 14: Example interaction from the generalization eval [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

read the original abstract

Toxic speech detection has become a crucial challenge in maintaining safe online communication environments. However, existing approaches to toxic speech detection often neglect the contribution of paralinguistic cues, such as emotion, intonation, and speech rate, which are key to detecting speech toxicity. Moreover, current toxic speech datasets are predominantly text-based, limiting the development of models that can capture paralinguistic cues.To address these challenges, we present ToxiAlert-Bench, a large-scale audio dataset comprising over 30,000 audio clips annotated with seven major toxic categories and twenty fine-grained toxic labels. Uniquely, our dataset annotates toxicity sources -- distinguishing between textual content and paralinguistic origins -- for comprehensive toxic speech analysis.Furthermore, we propose a dual-head neural network with a multi-stage training strategy tailored for toxic speech detection. This architecture features two task-specific classification headers: one for identifying the source of sensitivity (textual or paralinguistic), and the other for categorizing the specific toxic type. The training process involves independent head training followed by joint fine-tuning to reduce task interference. To mitigate data class imbalance, we incorporate class-balanced sampling and weighted loss functions.Our experimental results show that leveraging paralinguistic features significantly improves detection performance. Our method consistently outperforms existing baselines across multiple evaluation metrics, with a 21.1% relative improvement in Macro-F1 score and a 13.0% relative gain in accuracy over the strongest baseline, highlighting its enhanced effectiveness and practical applicability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New dataset with source-of-toxicity labels and dual-head model, but annotation reliability for the key distinction looks like the main open question.

read the letter

The one thing to know is that this paper brings a new audio dataset for toxic speech that explicitly labels whether the toxicity is in the words or in the delivery, along with a model designed to use both. They collected over 30,000 clips and annotated them for seven main toxic types plus twenty finer ones, and added this source distinction. The model uses two heads—one to pick the source and one for the type—with separate training first then joint fine-tuning. They also use class balancing and weighted loss. The results claim a 21 percent relative lift in Macro-F1 over the best baseline. The dataset itself is the clearest addition. Most prior work on toxic speech stays in text, so having audio with paralinguistic cues marked is a step forward for anyone building detectors that listen to how something is said. The staged training is a practical choice to keep the tasks from stepping on each other. The main concern is whether the source labels are solid. Everything rides on annotators being able to consistently separate textual content toxicity from paralinguistic origins without much noise. The abstract gives no numbers on agreement or any check on those labels, and the stress test note flags this correctly. If that split is noisy, then the source head learns junk and you cannot really credit the paralinguistic modeling for the gains. I would want to see the full methods section on how they collected and validated those annotations before accepting the performance story. This paper is aimed at researchers and engineers working on audio-based content moderation and online safety systems. Someone building tools for platforms would get value from the dataset if the labels check out, and the model gives a concrete starting point. It is worth sending to peer review because the dataset is new and the approach is grounded in a real gap, even though the evaluation details need close checking on the annotation quality. Send it for review but ask referees to focus on the label reliability and any controls for bias in the source annotations.

Referee Report

2 major / 1 minor

Summary. The paper introduces ToxiAlert-Bench, a dataset of over 30,000 audio clips annotated for seven major toxic categories, twenty fine-grained labels, and toxicity sources (textual content vs. paralinguistic origins). It proposes a dual-head neural network with multi-stage training (independent head training followed by joint fine-tuning), class-balanced sampling, and weighted loss to detect both the toxicity source and specific toxic type, claiming that incorporating paralinguistic cues yields a 21.1% relative Macro-F1 improvement and 13.0% accuracy gain over the strongest baseline.

Significance. If the central claims hold after verification, the work would be significant for speech toxicity detection by addressing the neglect of paralinguistic cues (emotion, intonation, speech rate) in existing text-centric approaches. The large-scale audio dataset with source annotations could serve as a useful benchmark, and the dual-head architecture with staged training offers a practical way to handle multi-task interference. The reported gains, if reproducible with proper controls, would demonstrate the value of audio-specific modeling in this domain.

major comments (2)

[Abstract and Dataset Construction] Abstract and Dataset section: The headline performance claims (21.1% relative Macro-F1 lift, 13% accuracy gain) rest on the assumption that the 30k-clip annotations cleanly separate textual from paralinguistic toxicity sources, yet no inter-annotator agreement, confusion matrix, or validation subset for the source labels is referenced. Without this, the source-classification head may learn noise, undermining attribution of gains to paralinguistic modeling.
[Abstract and Experimental Results] Abstract and Experimental Results: The abstract reports relative gains over baselines but supplies no experimental details, baseline descriptions, statistical tests, or controls for confounds such as dataset construction biases or label noise. This prevents verification of the central claim that paralinguistic features drive the improvement.

minor comments (1)

[Method] The multi-stage training strategy is described at a high level; adding pseudocode or a diagram of the independent-then-joint schedule would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each major comment point by point below, indicating the specific revisions we will make to improve clarity, verifiability, and robustness of the presented claims.

read point-by-point responses

Referee: [Abstract and Dataset Construction] Abstract and Dataset section: The headline performance claims (21.1% relative Macro-F1 lift, 13% accuracy gain) rest on the assumption that the 30k-clip annotations cleanly separate textual from paralinguistic toxicity sources, yet no inter-annotator agreement, confusion matrix, or validation subset for the source labels is referenced. Without this, the source-classification head may learn noise, undermining attribution of gains to paralinguistic modeling.

Authors: We acknowledge that the manuscript does not currently report inter-annotator agreement metrics, a confusion matrix, or a dedicated validation subset analysis specifically for the toxicity source labels (textual vs. paralinguistic). In the revised version, we will expand the Dataset Construction section to describe the annotation protocol in greater detail, report agreement statistics (e.g., Cohen's or Fleiss' kappa) for the source annotations, include a confusion matrix for source labels, and present performance on a held-out validation subset. These additions will directly address concerns about label reliability and strengthen the link between paralinguistic modeling and observed gains. revision: yes
Referee: [Abstract and Experimental Results] Abstract and Experimental Results: The abstract reports relative gains over baselines but supplies no experimental details, baseline descriptions, statistical tests, or controls for confounds such as dataset construction biases or label noise. This prevents verification of the central claim that paralinguistic features drive the improvement.

Authors: We agree that the abstract is too concise to convey experimental details. We will revise the abstract to briefly describe the baseline models (text-only and audio-based), note the use of class-balanced sampling and weighted loss as controls for imbalance and noise, and reference statistical significance testing for the reported improvements. The Experimental Results section will be expanded to explicitly discuss potential confounds such as dataset construction biases and label noise, along with the mitigation strategies employed and any statistical tests (e.g., McNemar's test or paired t-tests) used to validate the 21.1% Macro-F1 and 13% accuracy gains. revision: yes

Circularity Check

0 steps flagged

Empirical ML paper with no definitional or self-referential derivations

full rationale

The paper introduces a new audio dataset with human annotations distinguishing textual vs. paralinguistic toxicity sources and describes a dual-head neural network trained via multi-stage fine-tuning with class-balanced sampling. Performance gains (e.g., 21.1% relative Macro-F1) are reported from standard experimental comparisons against baselines on held-out data. No equations, uniqueness theorems, ansatzes, or predictions appear that reduce by construction to fitted parameters or self-citations; the central claims rest on empirical results rather than any load-bearing derivation chain that collapses to its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract provides limited technical detail; main contributions are empirical dataset and model design rather than new theoretical constructs. Standard neural-network assumptions are implicit.

free parameters (1)

class weights in weighted loss
Introduced to address class imbalance; specific values or fitting procedure not stated in abstract.

axioms (1)

domain assumption Paralinguistic cues in audio can be reliably distinguished from textual content by human annotators and learned by neural networks.
Central to both the dataset labeling and the dual-head model design.

pith-pipeline@v0.9.0 · 5821 in / 1427 out tokens · 65110 ms · 2026-05-19T18:31:37.787721+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 8 internal anchors

[1]

2023 , eprint=

Lightweight Toxicity Detection in Spoken Language: A Transformer-based Approach for Edge Devices , author=. 2023 , eprint=

work page 2023
[2]

Toxic Speech and Speech Emotions: Investigations of Audio-based Modeling and Intercorrelations , year=

Lin, Wei-Cheng and Emmanouilidou, Dimitra , booktitle=. Toxic Speech and Speech Emotions: Investigations of Audio-based Modeling and Intercorrelations , year=

work page
[3]

URL: https://web

Toxic speech detection , author=. URL: https://web. stanford. edu/class/archive/cs/cs224n/cs224n , volume=

work page
[4]

Audio-based Toxic Language Classification using Self-attentive Convolutional Neural Network , year=

Yousefi, Midia and Emmanouilidou, Dimitra , booktitle=. Audio-based Toxic Language Classification using Self-attentive Convolutional Neural Network , year=

work page
[5]

2022 , eprint=

Emotion Based Hate Speech Detection using Multimodal Learning , author=. 2022 , eprint=

work page 2022
[6]

2022 , booktitle =

DeToxy: A Large-Scale Multimodal Dataset for Toxicity Classification in Spoken Utterances , author =. 2022 , booktitle =. doi:10.21437/Interspeech.2022-10752 , issn =

work page doi:10.21437/interspeech.2022-10752 2022
[7]

2024 , booktitle =

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding , author =. 2024 , booktitle =. doi:10.21437/Interspeech.2024-65 , issn =

work page doi:10.21437/interspeech.2024-65 2024
[8]

arXiv preprint arXiv:2406.10325 , year=

Enhancing multilingual voice toxicity detection with speech-text alignment , author=. arXiv preprint arXiv:2406.10325 , year=

work page arXiv
[9]

Voice Toxicity Detection Using Multi-Task Learning , year=

Kumar Nandwana, Mahesh and He, Yifan and Liu, Joseph and Yu, Xiao and Shang, Charles and Du Bois, Eloi and McGuire, Morgan and Bhat, Kiran , booktitle=. Voice Toxicity Detection Using Multi-Task Learning , year=

work page
[10]

2024 , eprint=

Attentive Fusion: A Transformer-based Approach to Multimodal Hate Speech Detection , author=. 2024 , eprint=

work page 2024
[11]

2024 , eprint=

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. 2024 , eprint=

work page 2024
[12]

arXiv preprint arXiv:2503.11197 , year=

Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering , author=. arXiv preprint arXiv:2503.11197 , year=

work page arXiv
[13]

2024 , eprint=

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark , author=. 2024 , eprint=

work page 2024
[14]

Qwen2-Audio Technical Report

Qwen2-Audio Technical Report , author=. arXiv preprint arXiv:2407.10759 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

2022 , eprint=

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation , author=. 2022 , eprint=

work page 2022
[16]

2020 , eprint=

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , author=. 2020 , eprint=

work page 2020
[17]

ShieldGemma: Generative AI Content Moderation Based on Gemma

Shieldgemma: Generative ai content moderation based on gemma , author=. arXiv preprint arXiv:2407.21772 , year=

work page internal anchor Pith review arXiv
[18]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

NIST speech disc 1-1.1 , author=

DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1 , author=. NASA STI/Recon technical report n , volume=

work page
[20]

2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

Librispeech: an asr corpus based on public domain audio books , author=. 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2015 , organization=

work page 2015
[21]

MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations

Meld: A multimodal multi-party dataset for emotion recognition in conversations , author=. arXiv preprint arXiv:1810.02508 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Language resources and evaluation , volume=

IEMOCAP: Interactive emotional dyadic motion capture database , author=. Language resources and evaluation , volume=. 2008 , publisher=

work page 2008
[23]

arXiv preprint arXiv:1706.08612 , year=

Voxceleb: a large-scale speaker identification dataset , author=. arXiv preprint arXiv:1706.08612 , year=

work page arXiv
[24]

arXiv preprint arXiv:1912.06670 , year=

Common voice: A massively-multilingual speech corpus , author=. arXiv preprint arXiv:1912.06670 , year=

work page arXiv 1912
[25]

University of Edinburgh

CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit , author=. University of Edinburgh. The Centre for Speech Technology Research (CSTR) , volume=

work page
[26]

Keith Ito and Linda Johnson , title =

work page
[27]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

ACM Computing Surveys , volume=

Handling bias in toxic speech detection: A survey , author=. ACM Computing Surveys , volume=. 2023 , publisher=

work page 2023
[30]

Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

Toxic, hateful, offensive or abusive? what are we really classifying? an empirical analysis of hate speech datasets , author=. Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

work page
[31]

Proceedings of the SIGCHI conference on human factors in computing systems , pages=

Streaming on twitch: fostering participatory communities of play within live mixed media , author=. Proceedings of the SIGCHI conference on human factors in computing systems , pages=

work page
[32]

Journal of Research in Personality , volume=

The voice of confidence: Paralinguistic cues and audience evaluation , author=. Journal of Research in Personality , volume=. 1973 , publisher=

work page 1973
[33]

Patterns , volume=

Audio self-supervised learning: A survey , author=. Patterns , volume=. 2022 , publisher=

work page 2022
[34]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Ssast: Self-supervised audio spectrogram transformer , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[35]

2024 IEEE Spoken Language Technology Workshop (SLT) , pages=

E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts , author=. 2024 IEEE Spoken Language Technology Workshop (SLT) , pages=. 2024 , organization=

work page 2024
[36]

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching , author=. arXiv preprint arXiv:2410.06885 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Seed-tts: A family of high-quality versatile speech generation models , author=. arXiv preprint arXiv:2406.02430 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Pattern recognition , volume=

The global k-means clustering algorithm , author=. Pattern recognition , volume=. 2003 , publisher=

work page 2003
[39]

Educational and psychological measurement , volume=

A coefficient of agreement for nominal scales , author=. Educational and psychological measurement , volume=. 1960 , publisher=

work page 1960

[1] [1]

2023 , eprint=

Lightweight Toxicity Detection in Spoken Language: A Transformer-based Approach for Edge Devices , author=. 2023 , eprint=

work page 2023

[2] [2]

Toxic Speech and Speech Emotions: Investigations of Audio-based Modeling and Intercorrelations , year=

Lin, Wei-Cheng and Emmanouilidou, Dimitra , booktitle=. Toxic Speech and Speech Emotions: Investigations of Audio-based Modeling and Intercorrelations , year=

work page

[3] [3]

URL: https://web

Toxic speech detection , author=. URL: https://web. stanford. edu/class/archive/cs/cs224n/cs224n , volume=

work page

[4] [4]

Audio-based Toxic Language Classification using Self-attentive Convolutional Neural Network , year=

Yousefi, Midia and Emmanouilidou, Dimitra , booktitle=. Audio-based Toxic Language Classification using Self-attentive Convolutional Neural Network , year=

work page

[5] [5]

2022 , eprint=

Emotion Based Hate Speech Detection using Multimodal Learning , author=. 2022 , eprint=

work page 2022

[6] [6]

2022 , booktitle =

DeToxy: A Large-Scale Multimodal Dataset for Toxicity Classification in Spoken Utterances , author =. 2022 , booktitle =. doi:10.21437/Interspeech.2022-10752 , issn =

work page doi:10.21437/interspeech.2022-10752 2022

[7] [7]

2024 , booktitle =

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding , author =. 2024 , booktitle =. doi:10.21437/Interspeech.2024-65 , issn =

work page doi:10.21437/interspeech.2024-65 2024

[8] [8]

arXiv preprint arXiv:2406.10325 , year=

Enhancing multilingual voice toxicity detection with speech-text alignment , author=. arXiv preprint arXiv:2406.10325 , year=

work page arXiv

[9] [9]

Voice Toxicity Detection Using Multi-Task Learning , year=

Kumar Nandwana, Mahesh and He, Yifan and Liu, Joseph and Yu, Xiao and Shang, Charles and Du Bois, Eloi and McGuire, Morgan and Bhat, Kiran , booktitle=. Voice Toxicity Detection Using Multi-Task Learning , year=

work page

[10] [10]

2024 , eprint=

Attentive Fusion: A Transformer-based Approach to Multimodal Hate Speech Detection , author=. 2024 , eprint=

work page 2024

[11] [11]

2024 , eprint=

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. 2024 , eprint=

work page 2024

[12] [12]

arXiv preprint arXiv:2503.11197 , year=

Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering , author=. arXiv preprint arXiv:2503.11197 , year=

work page arXiv

[13] [13]

2024 , eprint=

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark , author=. 2024 , eprint=

work page 2024

[14] [14]

Qwen2-Audio Technical Report

Qwen2-Audio Technical Report , author=. arXiv preprint arXiv:2407.10759 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

2022 , eprint=

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation , author=. 2022 , eprint=

work page 2022

[16] [16]

2020 , eprint=

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , author=. 2020 , eprint=

work page 2020

[17] [17]

ShieldGemma: Generative AI Content Moderation Based on Gemma

Shieldgemma: Generative ai content moderation based on gemma , author=. arXiv preprint arXiv:2407.21772 , year=

work page internal anchor Pith review arXiv

[18] [18]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

NIST speech disc 1-1.1 , author=

DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1 , author=. NASA STI/Recon technical report n , volume=

work page

[20] [20]

2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

Librispeech: an asr corpus based on public domain audio books , author=. 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2015 , organization=

work page 2015

[21] [21]

MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations

Meld: A multimodal multi-party dataset for emotion recognition in conversations , author=. arXiv preprint arXiv:1810.02508 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Language resources and evaluation , volume=

IEMOCAP: Interactive emotional dyadic motion capture database , author=. Language resources and evaluation , volume=. 2008 , publisher=

work page 2008

[23] [23]

arXiv preprint arXiv:1706.08612 , year=

Voxceleb: a large-scale speaker identification dataset , author=. arXiv preprint arXiv:1706.08612 , year=

work page arXiv

[24] [24]

arXiv preprint arXiv:1912.06670 , year=

Common voice: A massively-multilingual speech corpus , author=. arXiv preprint arXiv:1912.06670 , year=

work page arXiv 1912

[25] [25]

University of Edinburgh

CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit , author=. University of Edinburgh. The Centre for Speech Technology Research (CSTR) , volume=

work page

[26] [26]

Keith Ito and Linda Johnson , title =

work page

[27] [27]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

ACM Computing Surveys , volume=

Handling bias in toxic speech detection: A survey , author=. ACM Computing Surveys , volume=. 2023 , publisher=

work page 2023

[30] [30]

Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

Toxic, hateful, offensive or abusive? what are we really classifying? an empirical analysis of hate speech datasets , author=. Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

work page

[31] [31]

Proceedings of the SIGCHI conference on human factors in computing systems , pages=

Streaming on twitch: fostering participatory communities of play within live mixed media , author=. Proceedings of the SIGCHI conference on human factors in computing systems , pages=

work page

[32] [32]

Journal of Research in Personality , volume=

The voice of confidence: Paralinguistic cues and audience evaluation , author=. Journal of Research in Personality , volume=. 1973 , publisher=

work page 1973

[33] [33]

Patterns , volume=

Audio self-supervised learning: A survey , author=. Patterns , volume=. 2022 , publisher=

work page 2022

[34] [34]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Ssast: Self-supervised audio spectrogram transformer , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[35] [35]

2024 IEEE Spoken Language Technology Workshop (SLT) , pages=

E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts , author=. 2024 IEEE Spoken Language Technology Workshop (SLT) , pages=. 2024 , organization=

work page 2024

[36] [36]

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching , author=. arXiv preprint arXiv:2410.06885 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Seed-tts: A family of high-quality versatile speech generation models , author=. arXiv preprint arXiv:2406.02430 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Pattern recognition , volume=

The global k-means clustering algorithm , author=. Pattern recognition , volume=. 2003 , publisher=

work page 2003

[39] [39]

Educational and psychological measurement , volume=

A coefficient of agreement for nominal scales , author=. Educational and psychological measurement , volume=. 1960 , publisher=

work page 1960