AffectCodec applies block-diagonal projections in residual FSQ to explicitly allocate bits to emotion and acoustic subspaces, combined with emotion conditioning, yielding better emotion preservation at low bitrates with competitive acoustic quality.
Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency without training.
3DXTalker unifies identity modeling, lip synchronization, emotional expression, and head-pose dynamics in audio-driven 3D avatars via 2D-to-3D curation, amplitude/emotion audio cues, and a flow-matching transformer with prompt control.
EMG signals from orofacial muscles are mapped via linear transformation into self-supervised speech representation space to enable direct audio synthesis, shown on an ALS patient during silent articulation.
citing papers explorer
-
AffectCodec: Emotion-Preserving Neural Speech Codec with Block-Diagonal Residual FSQ
AffectCodec applies block-diagonal projections in residual FSQ to explicitly allocate bits to emotion and acoustic subspaces, combined with emotion conditioning, yielding better emotion preservation at low bitrates with competitive acoustic quality.
-
AST: Adaptive, Seamless, and Training-Free Precise Speech Editing
AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency without training.
-
3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars
3DXTalker unifies identity modeling, lip synchronization, emotional expression, and head-pose dynamics in audio-driven 3D avatars via 2D-to-3D curation, amplitude/emotion audio cues, and a flow-matching transformer with prompt control.
-
emg2speech: Synthesizing speech from electromyography using self-supervised speech models
EMG signals from orofacial muscles are mapped via linear transformation into self-supervised speech representation space to enable direct audio synthesis, shown on an ALS patient during silent articulation.