SPARCLE: SPeaker-aware Aligned Representations via Contrastive Language Embeddings

· 2026 · cs.CL · arXiv 2607.01238

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Recent advances in speech synthesis have shifted from phoneme representations to direct grapheme modeling. While phonemes address the one-to-many mapping between text and acoustics, they rely on grapheme-to-phoneme (G2P) systems that fail to capture speaker-specific acoustic variation. Prior work demonstrates that grapheme-based models outperform phoneme-based systems at scale, but not in low-resource settings. In this paper, we propose SPARCLE, a speaker-aware grapheme representation model that enriches characters with their precise acoustic realizations. SPARCLE is trained with a contrastive objective to align graphemes with corresponding Wav2Vec2 acoustic representations while conditioned on speaker identity. The resulting model serves as a replacement to G2P systems for downstream text-to-speech (TTS) tasks. We demonstrate that SPARCLE improves generation quality, reducing word error rates by half in extreme low-resource settings compared to standard grapheme-based models.

representative citing papers

SPARCLE: SPeaker-aware Aligned Representations via Contrastive Language Embeddings

cs.CL · 2026-05-01 · unverdicted · novelty 4.0

SPARCLE builds speaker-aware grapheme representations by contrastively aligning characters with Wav2Vec2 acoustic embeddings conditioned on speaker identity, replacing G2P for TTS and halving WER in low-resource cases.

citing papers explorer

Showing 1 of 1 citing paper after filters.

SPARCLE: SPeaker-aware Aligned Representations via Contrastive Language Embeddings cs.CL · 2026-05-01 · unverdicted · none · ref 2 · internal anchor
SPARCLE builds speaker-aware grapheme representations by contrastively aligning characters with Wav2Vec2 acoustic embeddings conditioned on speaker identity, replacing G2P for TTS and halving WER in low-resource cases.

SPARCLE: SPeaker-aware Aligned Representations via Contrastive Language Embeddings

fields

years

verdicts

representative citing papers

citing papers explorer