SwanVoice is a zero-shot TTS system for 1-4 speakers that reports higher richness and hierarchy scores than open-source baselines on monologue and dialogue tasks via mixed training and DiffusionNFT post-training.
Title resolution pending
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Shared-score quaternion self-attention reduces score multiplications by 75% and softmax operations from four to one while proving equivalence to component-wise attention under quaternion linear projections.
SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.
Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.
EntangleCodec unifies semantic and acoustic audio tokenization via caption alignment and flow-matching decoding, reporting competitive reconstruction, +7.4% gains on MMAR understanding, and 0.6B-parameter ALMs surpassing 13B-parameter continuous baselines.
Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.
citing papers explorer
-
SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue
SwanVoice is a zero-shot TTS system for 1-4 speakers that reports higher richness and hierarchy scores than open-source baselines on monologue and dialogue tasks via mixed training and DiffusionNFT post-training.
-
Quaternion Self-Attention with Shared Scores
Shared-score quaternion self-attention reduces score multiplications by 75% and softmax operations from four to one while proving equivalence to component-wise attention under quaternion linear projections.
-
SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation
SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.
-
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.
-
EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement
EntangleCodec unifies semantic and acoustic audio tokenization via caption alignment and flow-matching decoding, reporting competitive reconstruction, +7.4% gains on MMAR understanding, and 0.6B-parameter ALMs surpassing 13B-parameter continuous baselines.
-
Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents
Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.
- Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech