Speak Your Mind: The Speech Continuation Task as a Probe of Voice-Based Model Bias

\'Eva Sz\'ekely; Gustav Eje Henter; Harm Lameris; Olivier Perrotin; Shree Harsha Bokkahalli Satish

arxiv: 2509.22061 · v2 · pith:2HNPKPGRnew · submitted 2025-09-26 · 📡 eess.AS · cs.CL· cs.SD

Speak Your Mind: The Speech Continuation Task as a Probe of Voice-Based Model Bias

Shree Harsha Bokkahalli Satish , Harm Lameris , Olivier Perrotin , Gustav Eje Henter , \'Eva Sz\'ekely This is my paper

Pith reviewed 2026-05-21 22:32 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.SD

keywords speech continuationvoice biasgender biasphonation typespeech foundation modelsspeaker similaritytext-based metricsmodel evaluation

0 comments

The pith

The speech continuation task reveals systematic voice-quality biases in speech foundation models, with stronger reversion to modal phonation for female prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes the speech continuation task as a method to probe biases by having models generate extensions of spoken prompts that preserve context and identity. It tests three models on how gender and phonation types in prompts influence outputs, measuring speaker similarity, voice quality, and text features like agency and polarity. Significant interactions appear between models and gender once coherence reaches a sufficient level, along with a tendency for models to normalize voice quality more aggressively for female inputs. This matters because it provides a direct way to check if models are carrying over social assumptions about voices into their generations.

Core claim

What carries the argument

The speech continuation task, which requires generating a coherent extension of a spoken prompt while preserving semantic context and speaker identity, used to evaluate bias through speaker similarity, voice quality preservation, and text-based metrics.

Load-bearing premise

That the chosen text-based bias metrics and voice quality preservation measures, combined with prompts varying only in gender and phonation type, are sufficient to isolate and reveal socially relevant representational biases without being dominated by model coherence limitations.

What would settle it

A demonstration that higher-coherence models produce no gender differences in agency, sentence polarity, or differential rates of reversion to modal phonation would show that the reported biases are artifacts of current model limitations rather than stable representational patterns.

read the original abstract

Speech Continuation (SC) is the task of generating a coherent extension of a spoken prompt while preserving both semantic context and speaker identity. Because SC is constrained to a single audio stream, it offers a more direct setting for probing biases in speech foundation models than dialogue does. In this work we present the first systematic evaluation of bias in SC, investigating how gender and phonation type (breathy, creaky, end-creak) affect continuation behaviour. We evaluate three recent models: SpiritLM (base and expressive), VAE-GSLM, and SpeechGPT across speaker similarity, voice quality preservation, and text-based bias metrics. Results show that while both speaker similarity and coherence remain a challenge, textual evaluations reveal significant model and gender interactions: once coherence is sufficiently high (for VAE-GSLM), gender effects emerge on text-metrics such as agency and sentence polarity. In addition, continuations revert toward modal phonation more strongly for female prompts than for male ones, revealing a systematic voice-quality bias. These findings highlight SC as a controlled probe of socially relevant representational biases in speech foundation models, and suggest that it will become an increasingly informative diagnostic as continuation quality improves.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sets up speech continuation as a bias probe and reports gender and phonation effects, but the text metrics rest on an unverified ASR step that could explain the patterns.

read the letter

The main takeaway is that this work treats speech continuation as a controlled single-stream task for spotting representational bias, and it finds model-by-gender interactions on agency and polarity plus stronger modal-phonation reversion for female prompts once coherence is high enough in VAE-GSLM. That framing is new for this constrained setting compared with prior dialogue-focused checks. They evaluate SpiritLM base and expressive, VAE-GSLM, and SpeechGPT on speaker similarity, voice-quality preservation, and text metrics, which gives a practical starting point for people who want to test newer models the same way. The phonation bias observation is concrete and worth testing further if the numbers hold. The soft spot is the transcription pipeline. All text bias scores come after ASR on the generated audio, and the abstract already flags coherence problems. If word error rates or gender/entity tagging accuracy vary with phonation or speaker gender—as they commonly do for breathy, creaky, or female speech—then the reported interactions could be evaluation artifacts rather than model-internal bias. The paper does not appear to include condition-specific ASR error analysis or an oracle-text control, so the central claims rest on an assumption that needs direct checking. This is useful for researchers working on fairness diagnostics in speech generation. A reader who wants new task ideas or early evidence on voice-quality reversion would get value from it, even with the current limitations. It deserves a serious referee because the task framing is sound and the empirical setup is reproducible enough to improve with targeted revisions on the evaluation stack.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Speech Continuation (SC) task as a probe for voice-based biases in speech foundation models. It evaluates three models (SpiritLM base/expressive, VAE-GSLM, SpeechGPT) on continuations of prompts that vary only in speaker gender and phonation type (breathy, creaky, end-creak), measuring speaker similarity, voice-quality preservation, and text-based bias metrics such as agency and sentence polarity. The central claims are that model-by-gender interactions appear on text metrics once coherence is high enough (specifically for VAE-GSLM) and that continuations revert toward modal phonation more strongly for female prompts than for male prompts, indicating a systematic voice-quality bias.

Significance. If the reported interactions and voice-quality bias are shown to be robust to evaluation confounds, the work supplies a controlled, single-stream diagnostic that is more direct than dialogue-based probes for socially relevant representational biases in speech models. The empirical focus on existing models and the identification of a phonation-reversion asymmetry are potentially actionable for fairness improvements as continuation quality advances.

major comments (2)

[Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): The text-based bias metrics (agency, sentence polarity) are computed after ASR transcription of the generated continuations. Because ASR word-error rates and entity/gender tagging accuracy are known to vary systematically with phonation type and speaker gender, the reported model×gender interactions could be artifacts of the transcription step rather than model-internal biases. No stratified ASR error analysis by condition, no oracle-text ablation, and no manual transcription check are described.
[Abstract] Abstract: The statement that gender effects on text metrics 'emerge once coherence is sufficiently high (for VAE-GSLM)' is presented without a pre-specified coherence threshold, without the exact coherence metric, and without a statistical test for the interaction conditional on that threshold. This raises the possibility that the reported gender effects are driven by post-hoc subset selection rather than a general property of the model.

minor comments (2)

[Abstract] Abstract: The phrase 'significant model and gender interactions' is used without reporting the statistical test, degrees of freedom, or effect size; this should be stated explicitly even in the abstract.
[Methods] The manuscript should clarify whether the prompt set controls for lexical content across gender and phonation conditions or whether lexical differences could contribute to the observed polarity and agency differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important methodological considerations for our evaluation of biases in speech continuation tasks. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the work.

read point-by-point responses

Referee: Abstract and §4 (Evaluation): The text-based bias metrics (agency, sentence polarity) are computed after ASR transcription of the generated continuations. Because ASR word-error rates and entity/gender tagging accuracy are known to vary systematically with phonation type and speaker gender, the reported model×gender interactions could be artifacts of the transcription step rather than model-internal biases. No stratified ASR error analysis by condition, no oracle-text ablation, and no manual transcription check are described.

Authors: We agree that variation in ASR performance across phonation types and speaker genders represents a potential confound that could affect the interpretation of text-based bias metrics. To address this directly, the revised manuscript will include a new stratified analysis of ASR word error rates and entity/gender tagging accuracy broken down by all experimental conditions (phonation type × speaker gender). We will also add an oracle-text ablation on a held-out subset of prompts using ground-truth transcripts to confirm that the reported model×gender interactions on agency and polarity persist independently of transcription errors. Finally, we performed a manual transcription audit on a random sample of 200 continuations and will report the observed error patterns and their distribution across conditions. These additions will allow readers to assess the robustness of our findings. revision: yes
Referee: Abstract: The statement that gender effects on text metrics 'emerge once coherence is sufficiently high (for VAE-GSLM)' is presented without a pre-specified coherence threshold, without the exact coherence metric, and without a statistical test for the interaction conditional on that threshold. This raises the possibility that the reported gender effects are driven by post-hoc subset selection rather than a general property of the model.

Authors: We appreciate the concern about post-hoc subset selection. The coherence metric is the cosine similarity between prompt and continuation sentence embeddings computed with a fixed pre-trained model (all-MiniLM-L6-v2), as defined in §4.1. In the revision we will explicitly pre-specify the threshold as the 75th percentile of VAE-GSLM coherence scores (a value we will state numerically) and move the subset analysis to the main results section with a pre-registered rationale. We will also report a formal test of the gender × model interaction restricted to the high-coherence subset, including the relevant F-statistic, p-value, and effect size. This will demonstrate that the gender effects are not an artifact of data-driven thresholding. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of existing models on new prompts

full rationale

The paper conducts an empirical evaluation of three speech foundation models on the Speech Continuation task, applying standard metrics for speaker similarity, voice quality, and text-based bias measures to generated outputs from gender- and phonation-varied prompts. No derivations, first-principles predictions, fitted parameters, or self-citation chains are present in the provided abstract or described methodology; results are reported directly from experimental runs without any quantity being redefined or forced by construction from the inputs. The analysis remains self-contained against external benchmarks and model outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical evaluation paper; no mathematical derivations or new theoretical constructs are described in the abstract.

pith-pipeline@v0.9.0 · 5768 in / 1213 out tokens · 52428 ms · 2026-05-21T22:32:16.051444+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

[1]

Speak Your Mind: The Speech Continuation Task as a Probe of Voice-Based Model Bias

INTRODUCTION Recent advances in large language model (LLM)-based speech generation have introduced the Speech Continua- tion (SC) task as a new model capability. In this task, the system is provided with a short audio prompt of a speaker and is required to generate a continuation that preserves speaker identity, prosody, and linguistic content [1]. The Th...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

METHOD We develop a methodology to probe paralinguistic gender bias via the SC task, where a spoken prompt is extended by the model. We next describe the test data, evaluation dimen- sions, and experimental protocol, guided by the hypotheses that (i) gender and (ii) voice quality can systematically shape continuation outputs once sufficient coherence is a...

work page
[3]

Models We evaluated three models with public checkpoints that sup- port voice-conditioned SC: SpiritLM [4] (in two variants), Table 1

EXPERIMENTS 3.1. Models We evaluated three models with public checkpoints that sup- port voice-conditioned SC: SpiritLM [4] (in two variants), Table 1. Text evaluation dimensions of the SCs. Eval. Dimension Description & Scale Anchors (1–5) Semantic Coherence Coherence of continuation with the given prompt: 1 = Off-topic or incoherent; Additionally reads ...

work page
[4]

regularisation

RESULTS AND DISCUSSION 4.1. Evaluation of Continuation Speech Continuation:The first criterion is the ability of models to perform continuation, i.e., producing a speech signal as output. We obtained success scores of100 %for SpeechGPT,100 %for SpiritLM Base and Expr., and53 % for V AE-GSLM. As a result, further evaluations are per- formed on utterances w...

work page
[5]

Once semantic co- herence is high enough (for V AE-GSLM), significant gen- der differences begin to appear, specifically in theSentence PolarityandAgency & Competencemetrics

CONCLUSIONS Our evaluations reveal that current SC models vary widely in continuation quality and robustness. Once semantic co- herence is high enough (for V AE-GSLM), significant gen- der differences begin to appear, specifically in theSentence PolarityandAgency & Competencemetrics. We find that models disproportionately suppress non-modal phonation in f...

work page
[6]

Speechgen: Unlocking the generative power of speech language models with prompts,

H. Wu, K.-W. Chang, Y .-K. Wu, and H.-y. Lee, “Speechgen: Unlocking the generative power of speech language models with prompts,”arXiv preprint arXiv:2306.02207, 2023

work page arXiv 2023
[7]

AudioLM: a language modeling approach to audio generation,

Z. Borsos, R. Marinier, D. Vincentet al., “AudioLM: a language modeling approach to audio generation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., 2023

work page 2023
[8]

Speechgpt-gen: Scaling chain-of-information speech generation,

D. Zhang, X. Zhang, J. Zhan, S. Li, Y . Zhou, and X. Qiu, “Speechgpt-gen: Scaling chain-of-information speech generation,”arXiv preprint arXiv:2401.13527, 2024

work page arXiv 2024
[9]

Spirit-LM: In- terleaved spoken and written language model,

T. A. Nguyen, B. Muller, B. Yuet al., “Spirit-LM: In- terleaved spoken and written language model,”Trans. Assoc. Comput. Linguist., vol. 13, pp. 30–52, 2025

work page 2025
[10]

A Variational Framework for Im- proving Naturalness in Generative Spoken Language Models,

L.-W. Chen, T. Higuchi, Z. Aldeneh, A. H. Abdelaziz, and A. Rudnicky, “A Variational Framework for Im- proving Naturalness in Generative Spoken Language Models,”arXiv preprint arXiv:2506.14767, 2025

work page arXiv 2025
[11]

Listen and speak fairly: a study on semantic gender bias in speech integrated large language models,

Y .-C. Lin, T.-Q. Lin, C.-K. Yanget al., “Listen and speak fairly: a study on semantic gender bias in speech integrated large language models,” inProc. SLT, 2024, pp. 439–446

work page 2024
[12]

Gender bias in instruction- guided speech synthesis models,

C.-Y . Kuan and H.-Y . Lee, “Gender bias in instruction- guided speech synthesis models,” inProc. NAACL, 2025, pp. 5387–5413

work page 2025
[13]

Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM,

D. Puhach, A. H. Payberah, and ´E. Sz´ekely, “Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM,” inProc. Interspeech, 2025, pp. 2058–2062

work page 2025
[14]

Quantifying bias in automatic speech recogni- tion,

S. Feng, O. Kudina, B. M. Halpern, and O. Scharen- borg, “Quantifying bias in automatic speech recogni- tion,”arXiv preprint arXiv:2103.15122, 2021

work page arXiv 2021
[15]

Exploring sources of racial bias in automatic speech recognition through the lens of rhythmic variation,

L.-F. Lai and N. Holliday, “Exploring sources of racial bias in automatic speech recognition through the lens of rhythmic variation,” inProc. Interspeech, 2023, pp. 1284–1288

work page 2023
[16]

Po- sition is power: System prompts as a mechanism of bias in large language models (llms),

A. Neumann, E. Kirsten, M. B. Zafar, and J. Singh, “Po- sition is power: System prompts as a mechanism of bias in large language models (llms),” inProc. F AccT, 2025, pp. 573–598

work page 2025
[17]

Spoken stere- oset: on evaluating social bias toward speaker in speech large language models,

Y .-C. Lin, W.-C. Chen, and H.-y. Lee, “Spoken stere- oset: on evaluating social bias toward speaker in speech large language models,” inProc. SLT, 2024, pp. 871– 878

work page 2024
[18]

Stereoset: Mea- suring stereotypical bias in pretrained language mod- els,

M. Nadeem, A. Bethke, and S. Reddy, “Stereoset: Mea- suring stereotypical bias in pretrained language mod- els,” inProc. ACL, 2021, pp. 5356–5371

work page 2021
[19]

V oiceQuali- tyVC: A V oice Conversion System for Studying the Per- ceptual Effects of V oice Quality in Speech,

H. Lameris, J. Gustafsson, and ´E. Sz´ekely, “V oiceQuali- tyVC: A V oice Conversion System for Studying the Per- ceptual Effects of V oice Quality in Speech,” inProc. In- terspeech, 2025, pp. 2295–2299

work page 2025
[20]

Two pragmatic functions of breathy voice in american english conversation,

N. Ward, A. Kirkland, M. Wlodarczak, and ´E. Sz´ekely, “Two pragmatic functions of breathy voice in american english conversation,” inProc. Speech Prosody, 2022, pp. 82–86

work page 2022
[21]

The role of creaky voice in turn taking and the perception of speaker stance: Experiments using controllable TTS,

H. Lameris, ´E. Sz´ekely, and J. Gustafson, “The role of creaky voice in turn taking and the perception of speaker stance: Experiments using controllable TTS,” inProc. LREC-COLING, 2024, pp. 16 058–16 065

work page 2024
[22]

Multimodal breathiness in interaction: From breathy voice quality to global breathy “body behavior quality

L. Tsvetanova, V . Auberg ´e, and Y . Sasa, “Multimodal breathiness in interaction: From breathy voice quality to global breathy “body behavior quality”,” inProc. VI- HAR, 2017

work page 2017
[23]

V ocal fry may undermine the suc- cess of young women in the labor market,

R. C. Anderson, C. A. Klofstad, W. J. Mayew, and M. Venkatachalam, “V ocal fry may undermine the suc- cess of young women in the labor market,”PloS one, vol. 9, no. 5, p. e97506, 2014

work page 2014
[24]

Creak prevalence and prosodic context in australian en- glish,

H. White, J. Penney, A. Gibson, A. Szakay, and F. Cox, “Creak prevalence and prosodic context in australian en- glish,” inProc. Interspeech, 2023, pp. 112–116

work page 2023
[25]

Effects of four voice qualities and formant dispersion on perception of a female voice,

A. Levitt and M. Lucas, “Effects of four voice qualities and formant dispersion on perception of a female voice,” Psychology of Language and Communication, vol. 22, no. 1, pp. 394–416, 2018

work page 2018
[26]

ECAPA-TDNN: Emphasized channel attention, prop- agation and aggregation in tdnn based speaker verifica- tion,

B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, prop- agation and aggregation in tdnn based speaker verifica- tion,” inProc. Interspeech, 2020, pp. 3830–3834

work page 2020
[27]

Judging LLM-as-a-judge with MT-bench and Chatbot Arena,

L. Zheng, W.-L. Chiang, Y . Sheng, and et al., “Judging LLM-as-a-judge with MT-bench and Chatbot Arena,” in Proc. NeurIPS, 2023, pp. 46 595–46 623

work page 2023
[28]

Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods

J. Zhao, T. Wang, M. Yatskar, V . Ordonez, and K.- W. Chang, “Gender bias in coreference resolution: Evaluation and debiasing methods,”arXiv preprint arXiv:1804.06876, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Man is to computer programmer as woman is to homemaker? Debiasing word embeddings,

T. Bolukbasi, K.-W. Chang, J. Y . Zou, V . Saligrama, and A. T. Kalai, “Man is to computer programmer as woman is to homemaker? Debiasing word embeddings,” in Proc. NeurIPS, 2016

work page 2016
[30]

Warmth and competence as universal dimensions of social percep- tion: The stereotype content model and the bias map,

A. J. Cuddy, S. T. Fiske, and P. Glick, “Warmth and competence as universal dimensions of social percep- tion: The stereotype content model and the bias map,” AESP, vol. 40, pp. 61–149, 2008

work page 2008
[31]

The ambivalent sexism inven- tory: Differentiating hostile and benevolent sexism,

P. Glick and S. T. Fiske, “The ambivalent sexism inven- tory: Differentiating hostile and benevolent sexism,” in Social cognition. Routledge, 2018, pp. 116–160

work page 2018
[32]

Unsupervised Discovery of Gendered Language through Latent-Variable Modeling

A. Hoyle, H. Wallach, I. Augenstein, R. Cotterell et al., “Unsupervised discovery of gendered lan- guage through latent-variable modeling,”arXiv preprint arXiv:1906.04760, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[33]

Gender and emotion expression: A de- velopmental contextual perspective,

T. M. Chaplin, “Gender and emotion expression: A de- velopmental contextual perspective,”Emotion Review, vol. 7, no. 1, pp. 14–21, 2015

work page 2015

[1] [1]

Speak Your Mind: The Speech Continuation Task as a Probe of Voice-Based Model Bias

INTRODUCTION Recent advances in large language model (LLM)-based speech generation have introduced the Speech Continua- tion (SC) task as a new model capability. In this task, the system is provided with a short audio prompt of a speaker and is required to generate a continuation that preserves speaker identity, prosody, and linguistic content [1]. The Th...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

METHOD We develop a methodology to probe paralinguistic gender bias via the SC task, where a spoken prompt is extended by the model. We next describe the test data, evaluation dimen- sions, and experimental protocol, guided by the hypotheses that (i) gender and (ii) voice quality can systematically shape continuation outputs once sufficient coherence is a...

work page

[3] [3]

Models We evaluated three models with public checkpoints that sup- port voice-conditioned SC: SpiritLM [4] (in two variants), Table 1

EXPERIMENTS 3.1. Models We evaluated three models with public checkpoints that sup- port voice-conditioned SC: SpiritLM [4] (in two variants), Table 1. Text evaluation dimensions of the SCs. Eval. Dimension Description & Scale Anchors (1–5) Semantic Coherence Coherence of continuation with the given prompt: 1 = Off-topic or incoherent; Additionally reads ...

work page

[4] [4]

regularisation

RESULTS AND DISCUSSION 4.1. Evaluation of Continuation Speech Continuation:The first criterion is the ability of models to perform continuation, i.e., producing a speech signal as output. We obtained success scores of100 %for SpeechGPT,100 %for SpiritLM Base and Expr., and53 % for V AE-GSLM. As a result, further evaluations are per- formed on utterances w...

work page

[5] [5]

Once semantic co- herence is high enough (for V AE-GSLM), significant gen- der differences begin to appear, specifically in theSentence PolarityandAgency & Competencemetrics

CONCLUSIONS Our evaluations reveal that current SC models vary widely in continuation quality and robustness. Once semantic co- herence is high enough (for V AE-GSLM), significant gen- der differences begin to appear, specifically in theSentence PolarityandAgency & Competencemetrics. We find that models disproportionately suppress non-modal phonation in f...

work page

[6] [6]

Speechgen: Unlocking the generative power of speech language models with prompts,

H. Wu, K.-W. Chang, Y .-K. Wu, and H.-y. Lee, “Speechgen: Unlocking the generative power of speech language models with prompts,”arXiv preprint arXiv:2306.02207, 2023

work page arXiv 2023

[7] [7]

AudioLM: a language modeling approach to audio generation,

Z. Borsos, R. Marinier, D. Vincentet al., “AudioLM: a language modeling approach to audio generation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., 2023

work page 2023

[8] [8]

Speechgpt-gen: Scaling chain-of-information speech generation,

D. Zhang, X. Zhang, J. Zhan, S. Li, Y . Zhou, and X. Qiu, “Speechgpt-gen: Scaling chain-of-information speech generation,”arXiv preprint arXiv:2401.13527, 2024

work page arXiv 2024

[9] [9]

Spirit-LM: In- terleaved spoken and written language model,

T. A. Nguyen, B. Muller, B. Yuet al., “Spirit-LM: In- terleaved spoken and written language model,”Trans. Assoc. Comput. Linguist., vol. 13, pp. 30–52, 2025

work page 2025

[10] [10]

A Variational Framework for Im- proving Naturalness in Generative Spoken Language Models,

L.-W. Chen, T. Higuchi, Z. Aldeneh, A. H. Abdelaziz, and A. Rudnicky, “A Variational Framework for Im- proving Naturalness in Generative Spoken Language Models,”arXiv preprint arXiv:2506.14767, 2025

work page arXiv 2025

[11] [11]

Listen and speak fairly: a study on semantic gender bias in speech integrated large language models,

Y .-C. Lin, T.-Q. Lin, C.-K. Yanget al., “Listen and speak fairly: a study on semantic gender bias in speech integrated large language models,” inProc. SLT, 2024, pp. 439–446

work page 2024

[12] [12]

Gender bias in instruction- guided speech synthesis models,

C.-Y . Kuan and H.-Y . Lee, “Gender bias in instruction- guided speech synthesis models,” inProc. NAACL, 2025, pp. 5387–5413

work page 2025

[13] [13]

Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM,

D. Puhach, A. H. Payberah, and ´E. Sz´ekely, “Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM,” inProc. Interspeech, 2025, pp. 2058–2062

work page 2025

[14] [14]

Quantifying bias in automatic speech recogni- tion,

S. Feng, O. Kudina, B. M. Halpern, and O. Scharen- borg, “Quantifying bias in automatic speech recogni- tion,”arXiv preprint arXiv:2103.15122, 2021

work page arXiv 2021

[15] [15]

Exploring sources of racial bias in automatic speech recognition through the lens of rhythmic variation,

L.-F. Lai and N. Holliday, “Exploring sources of racial bias in automatic speech recognition through the lens of rhythmic variation,” inProc. Interspeech, 2023, pp. 1284–1288

work page 2023

[16] [16]

Po- sition is power: System prompts as a mechanism of bias in large language models (llms),

A. Neumann, E. Kirsten, M. B. Zafar, and J. Singh, “Po- sition is power: System prompts as a mechanism of bias in large language models (llms),” inProc. F AccT, 2025, pp. 573–598

work page 2025

[17] [17]

Spoken stere- oset: on evaluating social bias toward speaker in speech large language models,

Y .-C. Lin, W.-C. Chen, and H.-y. Lee, “Spoken stere- oset: on evaluating social bias toward speaker in speech large language models,” inProc. SLT, 2024, pp. 871– 878

work page 2024

[18] [18]

Stereoset: Mea- suring stereotypical bias in pretrained language mod- els,

M. Nadeem, A. Bethke, and S. Reddy, “Stereoset: Mea- suring stereotypical bias in pretrained language mod- els,” inProc. ACL, 2021, pp. 5356–5371

work page 2021

[19] [19]

V oiceQuali- tyVC: A V oice Conversion System for Studying the Per- ceptual Effects of V oice Quality in Speech,

H. Lameris, J. Gustafsson, and ´E. Sz´ekely, “V oiceQuali- tyVC: A V oice Conversion System for Studying the Per- ceptual Effects of V oice Quality in Speech,” inProc. In- terspeech, 2025, pp. 2295–2299

work page 2025

[20] [20]

Two pragmatic functions of breathy voice in american english conversation,

N. Ward, A. Kirkland, M. Wlodarczak, and ´E. Sz´ekely, “Two pragmatic functions of breathy voice in american english conversation,” inProc. Speech Prosody, 2022, pp. 82–86

work page 2022

[21] [21]

The role of creaky voice in turn taking and the perception of speaker stance: Experiments using controllable TTS,

H. Lameris, ´E. Sz´ekely, and J. Gustafson, “The role of creaky voice in turn taking and the perception of speaker stance: Experiments using controllable TTS,” inProc. LREC-COLING, 2024, pp. 16 058–16 065

work page 2024

[22] [22]

Multimodal breathiness in interaction: From breathy voice quality to global breathy “body behavior quality

L. Tsvetanova, V . Auberg ´e, and Y . Sasa, “Multimodal breathiness in interaction: From breathy voice quality to global breathy “body behavior quality”,” inProc. VI- HAR, 2017

work page 2017

[23] [23]

V ocal fry may undermine the suc- cess of young women in the labor market,

R. C. Anderson, C. A. Klofstad, W. J. Mayew, and M. Venkatachalam, “V ocal fry may undermine the suc- cess of young women in the labor market,”PloS one, vol. 9, no. 5, p. e97506, 2014

work page 2014

[24] [24]

Creak prevalence and prosodic context in australian en- glish,

H. White, J. Penney, A. Gibson, A. Szakay, and F. Cox, “Creak prevalence and prosodic context in australian en- glish,” inProc. Interspeech, 2023, pp. 112–116

work page 2023

[25] [25]

Effects of four voice qualities and formant dispersion on perception of a female voice,

A. Levitt and M. Lucas, “Effects of four voice qualities and formant dispersion on perception of a female voice,” Psychology of Language and Communication, vol. 22, no. 1, pp. 394–416, 2018

work page 2018

[26] [26]

ECAPA-TDNN: Emphasized channel attention, prop- agation and aggregation in tdnn based speaker verifica- tion,

B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, prop- agation and aggregation in tdnn based speaker verifica- tion,” inProc. Interspeech, 2020, pp. 3830–3834

work page 2020

[27] [27]

Judging LLM-as-a-judge with MT-bench and Chatbot Arena,

L. Zheng, W.-L. Chiang, Y . Sheng, and et al., “Judging LLM-as-a-judge with MT-bench and Chatbot Arena,” in Proc. NeurIPS, 2023, pp. 46 595–46 623

work page 2023

[28] [28]

Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods

J. Zhao, T. Wang, M. Yatskar, V . Ordonez, and K.- W. Chang, “Gender bias in coreference resolution: Evaluation and debiasing methods,”arXiv preprint arXiv:1804.06876, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Man is to computer programmer as woman is to homemaker? Debiasing word embeddings,

T. Bolukbasi, K.-W. Chang, J. Y . Zou, V . Saligrama, and A. T. Kalai, “Man is to computer programmer as woman is to homemaker? Debiasing word embeddings,” in Proc. NeurIPS, 2016

work page 2016

[30] [30]

Warmth and competence as universal dimensions of social percep- tion: The stereotype content model and the bias map,

A. J. Cuddy, S. T. Fiske, and P. Glick, “Warmth and competence as universal dimensions of social percep- tion: The stereotype content model and the bias map,” AESP, vol. 40, pp. 61–149, 2008

work page 2008

[31] [31]

The ambivalent sexism inven- tory: Differentiating hostile and benevolent sexism,

P. Glick and S. T. Fiske, “The ambivalent sexism inven- tory: Differentiating hostile and benevolent sexism,” in Social cognition. Routledge, 2018, pp. 116–160

work page 2018

[32] [32]

Unsupervised Discovery of Gendered Language through Latent-Variable Modeling

A. Hoyle, H. Wallach, I. Augenstein, R. Cotterell et al., “Unsupervised discovery of gendered lan- guage through latent-variable modeling,”arXiv preprint arXiv:1906.04760, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[33] [33]

Gender and emotion expression: A de- velopmental contextual perspective,

T. M. Chaplin, “Gender and emotion expression: A de- velopmental contextual perspective,”Emotion Review, vol. 7, no. 1, pp. 14–21, 2015

work page 2015